Page 1 of 1

Is this a known probability distribution?

Posted: Thu Aug 10, 2017 5:38 am UTC
by Ingolifs
Originally posted on stackexchange, but with no answers or even comments, I'm posting it here:

For work, I plotted a set of shops by how many times they did business in the past week. The plot I got out looked something like this:

distribution.png
distribution.png (7.92 KiB) Viewed 3611 times


This looked obviously like a Pareto distribution. After all, that's the distribution of earthquakes and rich people. It would make sense in this sort of context. However, when plotting the log-log of this graph, instead of a straight line, I get this:

distributionlog.png
distributionlog.png (9.13 KiB) Viewed 3611 times


After further play with the data, I found that this curve could be straightened if I took a root of the log of the y value (in this case, the 3.9th root), and from that I derived a general formula for this and similar distributions I had seen:

p=exp(k√(-a log(x)+b))

where k,a,b are all positive

I've done various searches and have looked through lists of named distributions, and haven't come across any distributions that resemble this one. Is this a known distribution?

Re: Is this a known probability distribution?

Posted: Fri Aug 11, 2017 12:24 am UTC
by DaBigCheez
What are X and Y in these graphs? Presumably one of them is business-transactions-per-week, but what's the other axis?

This may be way off the mark, but what it reminds me of most strongly is when I've accidentally taken data intended to be used as a scatter plot, sorted it, and plotted it based on its index in the sorted list - which, in my applications, tended to produce very pretty-looking and totally meaningless graphs. Would the data perhaps be more usefully examined as a histogram or box-and-whisker plot or the like, if "transactions per week" is in fact the only 'real' variable, or is there another variable you didn't mention that has the extremely tight correlation?

Re: Is this a known probability distribution?

Posted: Fri Aug 11, 2017 2:25 am UTC
by Ingolifs
What are X and Y in these graphs? Presumably one of them is business-transactions-per-week, but what's the other axis?

Y is transactions per week, and X is Order, or Count, or whatever you want to call it. The graphs are recreations of the data, because I'm at home sick and the data is not to leave the workplace in any case.
So yes, this is a real effect and not something I accidentally mashed together. The actual data has a few more lumps in it, but I managed a correlation of 0.996 for my best fit, so there isn't any significant deviation from the graphs I presented.

Re: Is this a known probability distribution?

Posted: Tue Aug 15, 2017 2:46 am UTC
by DeGuerre
Have you tried fitting a log-normal distribution? Without knowing anything, my first hypothesis would be that your data follows Gibrat's Law.

Re: Is this a known probability distribution?

Posted: Wed Aug 16, 2017 9:39 am UTC
by Ingolifs
Yes, I looked at that. On a Log-Log plot, a lognormal distribution will show up as a parabola. This data shows up as an Nth root in the Log-Log plot.

Re: Is this a known probability distribution?

Posted: Wed Aug 16, 2017 2:36 pm UTC
by SuicideJunkie
DaBigCheez wrote:This may be way off the mark, but what it reminds me of most strongly is when I've accidentally taken data intended to be used as a scatter plot, sorted it, and plotted it based on its index in the sorted list - which, in my applications, tended to produce very pretty-looking and totally meaningless graphs.
I've done that kind of plot intentionally a few times.
I find it is good for deciding on cutoff points - you can visually see step changes in your data, and then quickly pick a point on the near-vertical portion of the step to be the cutoff between two groups (eg plot fault rate, and there tends to be a fuzzy step change between 'good' and 'broken' units with only a few marginal ones on the step itself).

Re: Is this a known probability distribution?

Posted: Thu Aug 17, 2017 5:36 am UTC
by Derek
Ingolifs wrote:Yes, I looked at that. On a Log-Log plot, a lognormal distribution will show up as a parabola. This data shows up as an Nth root in the Log-Log plot.

If you're looking for a log-normal distribution, you wouldn't apply the log-log transform to this graph of sales-versus-orders. You would apply it to a density curve (x-axis is sales, y-axis is number of stores with that many sales). I'm not sure the best way to get a density curve from a set of discrete samples though, but if you want to check the log-normal hypothesis I guess the thing to do would be to take the log of all the sales numbers, find the mean and standard deviation of that to get your normal distribution, then somehow measure the accuracy of this distribution against the actual data.