Comp578                                                                                                                     Susan Portugal Fall 2008

Assignment 2                                                                     September 11, 2008

You are approached by the marketing director of a local company, who believes that he has devised a foolproof way to measure customer satisfaction.  He explains his scheme as follows: "It's so simple that I can't believe that no one has thought of it before.  I just keep track of the number of customer complaints for each product.  I read in a data mining book that counts are ratio attributes, and so, my measure of product satisfaction must be a ratio attribute.  But when I rated the products based on my new customer satisfaction measure and showed them to my boss, he told me that I had overlooked the obvious, and that my measure was worthless.  I think that he was just mad beause out bestselling product had the worst satisfaction since it had the most complaints.  Could you help me set him straight?"

(a) Who is right, the marketing director or his boss?  If you answered, his boss, what would you do to fix the measure of satisfaction?

The boss is correct in this situation with the marketing director overlooking the obvious.  The number of complaints is a meaningless measurement when it doesn’t take into account the number of products purchased.  To fix the measurement of satisfaction analysis, one would have to take into account the number of products sold and compare it to the number of complaints filed.  

To determine which product has the most complaints, you have to compare the percentage of complaints divided by the number of products sold.  Another consideration that has to be taken into account is the scale of the minimum number of products sold to take an accurate analysis.  For example if a store sold two products: products x and y, that sold 100 units of x, and 2 units of y.  If the store received 30 complaints for product x, and 1 complaint for product y, then computing the percentage of complaints for each  product sold for product results in 30% and 50%.  When taking a quick look at the percentage rate of complaints, the boss would rush to fix the problem with 50% complaint rate.  Though in this case only 2 items of this product type was sold and the severity of the complaint is unknown.  Therefore placing a minimum number of products sold to take into account to make an accurate analysis is needed.

(b) What can  you say about the attribute trype of the orginial product satisfaction attribute?

The original product satisfaction attribute of the counts being ratio attributes is a correct analysis.  Although the data set is not comparable since each number count of complaints is not based on the same scale resulting a bias sample set of data.  This analysis is the same as having a sample set of temperatures measured in Celsius, Kalvin, and Fahrenheit and just reporting the numerical temperature without converting all measurements to one common scale domain.

A few months later, you are again approached by the same marketing director as in Exercise 3.  This time, he has devised a better approach to measure the extent to which a customer prefers one product ober other, similar products.  He explains, "When we develop new products, we typically create several variations and evaluate which one customers prefer.  Our standard procedure is to give our test subjects all of the product varaitions at one time and then ask them to rank the product variations in order of preference.  However, out test subjects are very indecisive, especially when there are more than two products.  As a result, testing takes forever.  I suggested that we perform the comparisons in pairs and then use these comparisons to get the rankings.  Thus, if we have three product variations, we have the customers compare variations 1 and 2, then 2 and 3, and finally 3 and 1.  Our testing time with my new procedure is a third of what is was for the old procedure, but the employees conducting the tests complain that they cannot come up with a consistent ranking from the results.  And my boss wants the latest product evaluations, yesterday.  I should also mention that he was the person who came up with the old product evaluation approach.  Can you help me?"

(a) Is the marketing director in trouble? Will his approach work for generating an ordinal ranking of the product variations in terms of customer preference?  Explain.

The marketing director is in trouble in terms of not collecting an accurate data set.  When there are different testers to complete comparisons between the three products the sample set will not result in a correlating top ranked product.  When comparing sample 1 and 2, tester 1 may rank sample 2 with greater satisfaction.  When tester 2 ranks samples 2 and 3, the tester ranks 3 higher than 1 for satisfaction.  When the last tester rates the more satisfactory product between 1 and 3, test 3 may rank sample 3 with higher satisfaction.  With each tester ranking the highest satisfied product 1, 2, and 3 in the highest position, there is no conclusion gathered from the resulting test.

(b) Is there a way to fix the marketing director's approach?  More generally what can you say about trying to create an ordinal measurement scale based on pairwise comparisons?

There are different approaches to fixing the marketing director’s approach to save time.  One method of fixing the problem of conflicting data is to give the tester guidelines as to what they are to look for when testing a product.  For example if the product under test is a granola bar then the following aspects can be ranked: visually appetizing, favor, chewy, health appealing, etc.  By evaluating more aspects of the product, the analysis will have a more thorough evaluation of the top rated product.  This method also give insight as to why one product ranks as an over all high satisfactory product and can give additional information as to what new products may also sell well when the come into the market.

(c) For the original product evalution scheme, the overall ranking of each product variation are found by computing its average over all test subjects.  Comment on whether you think what this is a reasonable approach.  What other approaches might you take?

The original measurement scheme can be an effective testing procedure to find the top ranking product when 3 products are compared.  Although taking the mean of the rankings may result in some products tying and not truly representing the data properly.  If the tester ranks the product on a ten point scale, then the results can be averaged to find which product ranks the highest position and/or if the products are viewed as equal (Product A: 5, Product B: 5, Product C: 5).  This may also reveal if the tester truly feels that a product is comparable to another and reveal if a product is highly satisfactory or if overall the product is average.

Can you think of a situation in which identification numbers would be useful for prediction?

One situation where identification numbers would be useful for prediction is in the case of public transportation.  With the use of identification numbers, the public transportation system could predict the minimum revenue accrued for each month.  Not all customers use public transpiration each day of the week.  In the case of a student that has class only three days a week, this identification number can predict their contribution to future revenue.  Identification numbers can show how many users frequently use the public transpiration and how many users are just sporadic.  Identification numbers can also track routes for users who may get on and off the bus to transfer to a train.  This can be helpful to plan for future improvements on which new routes that the transportation system can be improved with by predicting how many users will use the new route if it is more direct. 

Another useful information obtain from the identification number is how many people in a county actually have ever used the public transportation.  When the identification numbers are given out in order, than at any point in time a count of how many people use the transportaion can be obtain.  This can as help create a count as to how many new people start using public transportation each year to predict future growth.

The following attributes are measured for members of a herd of Asian elephants: weight, height, trunk length, and ear area.  Based on these measurements, what sort of similarity measure from Section 2.4 would you use to compare or group these elephants?  Justy your answer and explain any special circumstances.

The Asian elephant’s attributes listed above are ratio data qualities.  Based on the provided attributes, each attribute has a different scale measurement; for example: weight is measured in grams, tusk length and trunk length are measured in meters, and ear area can be measured in meters squared.  One scale of measurement to normalize data is performed with the Mahalanobis distance method.  These method of measurement measures the distance between two objects (x and y) in a vector form:

Mahalanobis(x,y) = (x-y)∑-1(x-y)­T.

x = interest point one

y = interest point two

∑-1 = inverse of the covariance matrix

You are given a set of m objects that is divided into K groups, where the ith group is of size mi.  If the goal is to obtain a sample of size n < m, what is the difference between the following two sampleing schemes? (Assume sampling with replacement.)

(a) We randomly select n * mi/m elements from each group. (b) We randomly select n elements from the data set, without regard for the group to which an object belongs.

Sampling scheme one, the resulting elements are a randomly equally distributed depiction of each group.  For sampling scheme two, in order to acquire a complete equally distributed representation of all groups n has to equal m.  The second method does not fully represent each group, for no matter how many randomly selected elements are chosen it is not certain that one element is chosen from each group.

Consider a document-term matrix, where tfij is the frequency of the ith word (term) in the jth document and m is the number of documents.  Consider the variable transformation that is defined by

where df, is the number of documents in which the ith term appears, which is known as the document frequency of the term.  This transformation is known as the inverse document frequency transformation.

(a) What is the effect of this transformation if a term occurs in one document?  In every document?

The effects of the transformation of the term if it only appears in one document is a high frequency count for the inverse document frequency transformation.  If the term appears in every document then the inverse document frequency transformation converges to 0.

(b) What might be the purpse of this transformation?

The purpose of this transformation can be to contribute to searching for documents with a main idea.  For searching through a large data base, searching for a common word that appears in several documents that leads to the main idea—the user will find it easier to locate their interest, by not focusing on words that don’t directly relate to the main idea.


Introduction to Data Mining,By: Pang-Ning Tan, Michael Steinbach, Vipin Kumar - Addison Wesley,2005,0321321367