Hello everyone,

I'm a long time follower and fan of xkcd and was hoping that this community can help me with a problem I (and all my colleagues) are currently having with my/our employer.

You see, they've brilliantly decided to improve customer satisfaction by using 'mystery shopping'. (Instead of using that money to, for example, making sure we can actually be reached by phone, hire more personnel to actually help customers or ensure the prices and availability on our website correspond to the actual situation in our store.)

Every month a single mystery shopper has 3 conversations with our the sales assistants in our store, and based on that result the management believes they can judge how well we're doing our jobs.

I'm not an expert in statistics, far from it, but my gut tells me that somehow, this kind of test doesn't really make much sense and that these results are too random to have much significance.

I looked up how to calculate statistical relevance and found out about the T and P and H0 stuff, but can't wrap my head around it... so I was hoping to get from help from all the smart guys that DO actually understand a statistically relevant percentage of xkcd's comics.

I can give some variables.

Our store is open 6 days a week (and soon we'll be open one sunday every month, too. Maybe it might be useful to take that into the calculation, since I want to argue how much sense it makes to continue doing these tests). We're open from 10:00 until 18:30, Friday 'til 19:00.

We have about 50 sales assistants (so not management), working 35 hours per week.

It's hard to say how many sales conversations each of these employee has, but judging from myself (which I know, isn't really relevant, but we have to use some number) I talk to roughly 10 customers every hour. (Some are lengthy conversations, others are simply pointing them in the right direction.)

So, working with these numbers, how statistically relevant is one single mystery shopper having three random faked 'sales conversations' once a month? And how many mystery shopping trips would it take to get meaningful results?

(I briefly considered asking to calculate in the odds of the mystery shopper being in a good or a bad mood, the item he's (faking to be) interested in being available, the impact of our air conditioning not working -it was, for example, roughly 30 degrees Celcius last month, which really had an impact on both customers and employees- and whatever other completely random factors weigh in, but that's not really possible I guess. So cold hard numbers will have to do.. That's how management treats customers and employees anyway.)

Thank you very much for your time and input. I really hope to use these numbers during our next meeting discussing the mystery shopping results.

## Please help with my employer & Statistical Significance

**Moderators:** gmalivuk, Moderators General, Prelates

### Re: Please help with my employer & Statistical Significance

To get a statistically relevant sample, you probably do need a lot of trials. When you try to do statistics with small numbers, weird things can happen. For example, say the odds of one of the mystery shoppers coming back with a bad report is 1%. Certainly on average, you would expect that there should be no problems. But you can run into difficulty if you try to extrapolate... suppose that two months in a row, there's two mystery shoppers that come back with a bad report. The odds of this are fairly low, (with 3 mystery shoppers, it works out to a bit under 1%, I think). But if you try to extrapolate from that, your boss might say "One in three customers is having a bad experience, this is a crisis", which is false. 99% are having a good experience, you just got unlucky and caught two outliers.

Here's a little app that lets you calculate survey sizes based on a given population and the error you want. Suppose you want to estimate the number of "bad experiences" with an error of +/-5%, then assuming you have 42000 transactions a month (50 people * 35 hours/week * 4 weeks/month * 10 transactions/hour), then you need a sample size of 381 people to get the right answer (within that error) 95% of the time.

Here's a little app that lets you calculate survey sizes based on a given population and the error you want. Suppose you want to estimate the number of "bad experiences" with an error of +/-5%, then assuming you have 42000 transactions a month (50 people * 35 hours/week * 4 weeks/month * 10 transactions/hour), then you need a sample size of 381 people to get the right answer (within that error) 95% of the time.

### Re: Please help with my employer & Statistical Significance

I'd suspect that this is less about doing a statistical trial of how well your customers are being treated, and more about scaring employees into treating customers better, since any one of them could be the "mystery customer" and potentially call them out if they do a crappy job.

she/they

gmalivuk wrote:Yes. And if wishes were horses, wishing wells would fill up very quickly with drowned horses.King Author wrote:If space (rather, distance) is an illusion, it'd be possible for one meta-me to experience both body's sensory inputs.

### Re: Please help with my employer & Statistical Significance

For what it's worth, estimating the exact probability of failure takes a few samples, as LaserGuy points out. However, if you have high expectations, then a test like this can say enough. Suppose you go to a restaurant which costs at least 600 dollars for an entire family, and you never want them to have a poor experience. One or two bad reviews can be very serious (serious enough to fire whoever was responsible for the transgression.)

I'm not great at statistics, but I think basic hypothesis testing is all you need to understand this. Suppose you want your service to work well at least 95 percent of the time. Consider 2 months before review (I'm assuming a really small sample to make the math easier). Assuming your service works at least 95 percent of the time, getting 1 or 0 failures out of 6 tests has a max probability of (.95)^6 * (.05)^0 * (6 choose 0) + (.95)^5 * (0.05)^1*(6 choose 1) = 0.9672. Two failures gives about 0.0305 probability, any more failures makes the probability smaller.

I throw out the claim, the service is at least .95, if the measured probability is less then 0.05. Assuming the probability the service works .95 of the time, then the probability I incorrectly decide that the claim is false is just the probability of getting 2 or more bad conversations, which is 1 - 0.9672 = 0.0328. This number is only kind of small (though I wouldn't bet on the service being good more then 95 percent of the time).

Now I wanted to work through this to emphasize that one can gather an enormous amount of information from relatively few samples, depending on the question.

All this being said, a professional reviewer is going to have a ton more information at their disposal. Consider a reviewer investigating a restaurant. They ask an employee for an opinion on the difference between wines, if the employee can describe the qualitative difference between wines from two regions, that's impressive. They can observe whether there are too many employees, or whether the restaurant is handling a large number of people well. They can intentionally harass the employee to see how stable and responsive they are. The point is, the situation is far more complicated then some binary sort of thing. But this complexity can give a human an enormous amount of information to make conclusions. That is, I wouldn't be surprised if a really good reviewer could have a solid sense of the quality of a restaurant with even less information then what these reviewers are going to be given. And this information really isn't even that statistical in nature (in the sense that they aren't estimating the quantity of various "things" in a larger population of "things").

Honestly, assuming everyone involved isn't stupid, I don't think this review system is a bad idea.

I'm not great at statistics, but I think basic hypothesis testing is all you need to understand this. Suppose you want your service to work well at least 95 percent of the time. Consider 2 months before review (I'm assuming a really small sample to make the math easier). Assuming your service works at least 95 percent of the time, getting 1 or 0 failures out of 6 tests has a max probability of (.95)^6 * (.05)^0 * (6 choose 0) + (.95)^5 * (0.05)^1*(6 choose 1) = 0.9672. Two failures gives about 0.0305 probability, any more failures makes the probability smaller.

I throw out the claim, the service is at least .95, if the measured probability is less then 0.05. Assuming the probability the service works .95 of the time, then the probability I incorrectly decide that the claim is false is just the probability of getting 2 or more bad conversations, which is 1 - 0.9672 = 0.0328. This number is only kind of small (though I wouldn't bet on the service being good more then 95 percent of the time).

Now I wanted to work through this to emphasize that one can gather an enormous amount of information from relatively few samples, depending on the question.

All this being said, a professional reviewer is going to have a ton more information at their disposal. Consider a reviewer investigating a restaurant. They ask an employee for an opinion on the difference between wines, if the employee can describe the qualitative difference between wines from two regions, that's impressive. They can observe whether there are too many employees, or whether the restaurant is handling a large number of people well. They can intentionally harass the employee to see how stable and responsive they are. The point is, the situation is far more complicated then some binary sort of thing. But this complexity can give a human an enormous amount of information to make conclusions. That is, I wouldn't be surprised if a really good reviewer could have a solid sense of the quality of a restaurant with even less information then what these reviewers are going to be given. And this information really isn't even that statistical in nature (in the sense that they aren't estimating the quantity of various "things" in a larger population of "things").

Honestly, assuming everyone involved isn't stupid, I don't think this review system is a bad idea.

### Re: Please help with my employer & Statistical Significance

polymer wrote:All this being said, a professional reviewer is going to have a ton more information at their disposal. Consider a reviewer investigating a restaurant. They ask an employee for an opinion on the difference between wines, if the employee can describe the qualitative difference between wines from two regions, that's impressive. They can observe whether there are too many employees, or whether the restaurant is handling a large number of people well. They can intentionally harass the employee to see how stable and responsive they are. The point is, the situation is far more complicated then some binary sort of thing. But this complexity can give a human an enormous amount of information to make conclusions. That is, I wouldn't be surprised if a really good reviewer could have a solid sense of the quality of a restaurant with even less information then what these reviewers are going to be given. And this information really isn't even that statistical in nature (in the sense that they aren't estimating the quantity of various "things" in a larger population of "things").

Honestly, assuming everyone involved isn't stupid, I don't think this review system is a bad idea.

FWIW, it sounds to me like the OP is talking more about a generic retail rather than a restaurant, based on the employee numbers. I can imagine what you're describing working quite well for a restaurant where you're there for a longer period of time and have multiple interactions with at least one, if not several, employees. I don't know that you could be as thorough in retail.

### Re: Please help with my employer & Statistical Significance

LaserGuy wrote:FWIW, it sounds to me like the OP is talking more about a generic retail rather than a restaurant, based on the employee numbers. I can imagine what you're describing working quite well for a restaurant where you're there for a longer period of time and have multiple interactions with at least one, if not several, employees. I don't know that you could be as thorough in retail.

That's fair. Still, one shouldn't underestimate an honest third party opinion.

- jestingrabbit
- Factoids are just Datas that haven't grown up yet
**Posts:**5967**Joined:**Tue Nov 28, 2006 9:50 pm UTC**Location:**Sydney

### Re: Please help with my employer & Statistical Significance

Reminds me of this.

http://xkcd.com/651/

Sometimes stuff isn't about rationality, its about power and rules. Sucks to be you, but even if you explain that you're right, you won't get any better treatment.

http://xkcd.com/651/

Sometimes stuff isn't about rationality, its about power and rules. Sucks to be you, but even if you explain that you're right, you won't get any better treatment.

ameretrifle wrote:Magic space feudalism is therefore a viable idea.

### Re: Please help with my employer & Statistical Significance

Oh, I have no illusions about them stopping this or treating us better.

Consider it stress relief or therapy or something.

Every once in a while, they'll have a meeting to discuss these numbers (read: a monologue with excel-exported powerpoint slides where all good numbers are ignores and bad numbers get a larger font, a red color and bolded) and it kind of feels good to get all that stuff of your chest then, you know?

Pointing out the futility of spending money on this (really, it seems more of a marketing scam from the company doing the mystery tests than anything else) on a somewhat scientific basis and coming up with some constructive ideas to actually improve customer service on a structural level instead of just looking at a one completely random result... Everyone knows it'll get ignored by management, but it does give me a chance to ventilate and it seems to boost morale among us worker drones. Creates some sort of collective 'us against them' feeling.

-- Disclaimer: I'm well aware this isn't a healthy situation for a company to be in and that this is more a psychological than anything else, but that's not really important in this discussion. Management only sees numbers, I want to show them a number and perhaps a formula to get to that number and explain what it means.

Consider it stress relief or therapy or something.

Every once in a while, they'll have a meeting to discuss these numbers (read: a monologue with excel-exported powerpoint slides where all good numbers are ignores and bad numbers get a larger font, a red color and bolded) and it kind of feels good to get all that stuff of your chest then, you know?

Pointing out the futility of spending money on this (really, it seems more of a marketing scam from the company doing the mystery tests than anything else) on a somewhat scientific basis and coming up with some constructive ideas to actually improve customer service on a structural level instead of just looking at a one completely random result... Everyone knows it'll get ignored by management, but it does give me a chance to ventilate and it seems to boost morale among us worker drones. Creates some sort of collective 'us against them' feeling.

-- Disclaimer: I'm well aware this isn't a healthy situation for a company to be in and that this is more a psychological than anything else, but that's not really important in this discussion. Management only sees numbers, I want to show them a number and perhaps a formula to get to that number and explain what it means.

### Re: Please help with my employer & Statistical Significance

LaserGuy wrote:FWIW, it sounds to me like the OP is talking more about a generic retail rather than a restaurant, based on the employee numbers. I can imagine what you're describing working quite well for a restaurant where you're there for a longer period of time and have multiple interactions with at least one, if not several, employees. I don't know that you could be as thorough in retail.

I'm sure there is still room for a comprehensive evaluation of the total customer experience. Let's say that we're talking about a big-box electronics retailer. If I'm a mystery shopper, then I can easily observe whether the store is properly lit, whether there are good supplies of the weekly special brochures by the door, whether all of the employees are well-groomed and friendly, whether there are any holes in the shelves for merchandise that hasn't been stocked, and so on.

And statistical significance is an important concept, but this isn't a medical study. Let's say that sales data shows that the Westburg store isn't selling as many big screen televisions as the other stores in the region and the mystery customer program at Westburg shows that sales associate John was found to be uninformed about the differences between two models of big screen televisions. The managers of this store don't need to reject the null hypothesis that John's performance is unrelated to the sales figures with 95% confidence before taking action. Making sure that John is a fully informed and helpful representative is a sensible strategy whether or not it will have a significant impact on sales figures. In the language of statistics (assuming I correctly remember my vocabulary), the α is relatively low compared with scholarly work but that's okay because there is little regret from Type-I error.

### Re: Please help with my employer & Statistical Significance

Tirian wrote:LaserGuy wrote:FWIW, it sounds to me like the OP is talking more about a generic retail rather than a restaurant, based on the employee numbers. I can imagine what you're describing working quite well for a restaurant where you're there for a longer period of time and have multiple interactions with at least one, if not several, employees. I don't know that you could be as thorough in retail.

I'm sure there is still room for a comprehensive evaluation of the total customer experience. Let's say that we're talking about a big-box electronics retailer. If I'm a mystery shopper, then I can easily observe whether the store is properly lit, whether there are good supplies of the weekly special brochures by the door, whether all of the employees are well-groomed and friendly, whether there are any holes in the shelves for merchandise that hasn't been stocked, and so on.

And statistical significance is an important concept, but this isn't a medical study. Let's say that sales data shows that the Westburg store isn't selling as many big screen televisions as the other stores in the region and the mystery customer program at Westburg shows that sales associate John was found to be uninformed about the differences between two models of big screen televisions. The managers of this store don't need to reject the null hypothesis that John's performance is unrelated to the sales figures with 95% confidence before taking action. Making sure that John is a fully informed and helpful representative is a sensible strategy whether or not it will have a significant impact on sales figures. In the language of statistics (assuming I correctly remember my vocabulary), the α is relatively low compared with scholarly work but that's okay because there is little regret from Type-I error.

This is spot on. When done properly, secret shopping is primarily a qualitative research practice. It can't be used to quantify frequency of good/poor interactions, but it can describe and analyze the customer experience in order to find focal points for improvement. That's very difficult to do inexpensively with a quantitative study (and such a study would need to be based on qualitative observations anyway). While an employer may be concerned about the frequency of poor customer interaction, it is much more important to know more detail about such interactions in order to help facilitate better interactions in the future.

At least, that's how good secret shopping works. If the secret shopper is inexperienced or unqualified, or if the data is misused (like trying to estimate the proportion of good/bad interactions, or to generalize a particular employees interaction with the secret shopper to all of that employee's interactions) then it's likely to do more harm than good. Here's a good not-so-unbiased article about good vs. bad secret shopping:

http://theunsecretshopper.com/2010/06/29/why-secret-shopping-doesnt-work/

E: But yeah, it could also just be about keeping employees on their toes with the secret shopping business, in which case no amount of reasoning is likely to sway them towards good management technique.

### Who is online

Users browsing this forum: No registered users and 20 guests