Page 1 of 1

Excel needs to sod off and die

Posted: Thu Oct 15, 2009 11:20 pm UTC
by Atre
I needed to extrapolate intermediate values for some collected data, so I graphed it in Excel (scatter graph) and fitted a 6th order Polynomial to the data (damn good fit) I planned to nick the equation for the polynomial to predict values between my data points... Problem is that it doesn't work, the stated equation for the polynomial doesn't fit by several orders of magnitude!!

Does anyone know how to fix the problem?

PS.Data
Spoiler:
80 0.00402
100 0.004415
120 0.004585
140 0.004585
160 0.004545
180 0.0045
200 0.00444
220 0.004405
240 0.004375
260 0.004355
273.15 0.004334601
280 0.004335766
300 0.00434
350 0.004368
400 0.00437

Re: Excel needs to sod off and die

Posted: Thu Oct 15, 2009 11:30 pm UTC
by achan1058
Can you give me the polynomial, and why are you even trying to fit a deg 6 polynomial to start with?

Re: Excel needs to sod off and die

Posted: Fri Oct 16, 2009 1:02 am UTC
by Kizyr
Even without looking at your data, you sound like you're describing the basic problem with overfitting.

A 6th-order polynomial will naturally be a "damn good fit", because the more orders, the more you're precisely fitting the function to your data. The problem there is that when you try to generalize to new data (such as the spaces between your data points), it won't be a very good fit. KF

Re: Excel needs to sod off and die

Posted: Fri Oct 16, 2009 2:13 am UTC
by achan1058
Interesting looking data. I don't think polynomial fit would work on this thing......

Re: Excel needs to sod off and die

Posted: Fri Oct 16, 2009 2:45 am UTC
by masher
Fix it by doing it properly?

It works for me...
Spoiler:
excel.png
excel.png (17.63 KiB) Viewed 3852 times




@Kizyr: I'd be comfortable using this formula between ~140 and 300. I'd be worried about the edges of the dataset. Assuming we can call the temperature dependence "smooth"

Re: Excel needs to sod off and die

Posted: Fri Oct 16, 2009 7:33 am UTC
by Atre
Interesting, the polynomial you've got is subtly different to mine.... given the magnitude of the power terms that might make a massive difference (seriously, for x =100, looks good on the graph; but the displayed eqt gave the answer of -2000ish when pasted into the spreadsheet. Was not pleased) Cheers for the help, I'll plug and chug with your (presumably) good equation

Re: Excel needs to sod off and die

Posted: Sat Oct 17, 2009 11:08 am UTC
by Atre
Solved Problem: The default display of the equation (as seen in Masher's post) leads to horrible rounding errors. Need to right click equation and change the display to many sig fig and then the equation works.... eejit machine.


In answer to why I'm doing this - as said in first post I need to convert between the quantities that are x & y on that plot. For a huge dataset (several hundred) that is a real pain in the arse to do by hand. An automated equation to do the job would be lovely, a 6th order polynomial doesn't have anything to do with the underlying physics but it matches most of the data well and that is all I require. For where it doesn't fit I have another solution :D ...

Re: Excel needs to sod off and die

Posted: Sat Oct 17, 2009 11:46 am UTC
by Ended
If you're getting frustrated with Excel, there are many other tools which can do a good job at this sort of thing. E.g. in gnuplot (assuming the data is in a file called 'data'):

Code: Select all

f(x) = a + b*x + c*x**2 + d*x**3 + e*x**4 + f*x**5 + g*x**6
fit f(x) 'data' via a,b,c,d,e,f,g
plot f(x), 'data'


Output:
Spoiler:

Code: Select all

After 20 iterations the fit converged.
final sum of squares of residuals : 4.30903e-10
rel. change during last iteration : -8.26587e-08

degrees of freedom    (FIT_NDF)                        : 8
rms of residuals      (FIT_STDFIT) = sqrt(WSSR/ndf)    : 7.33914e-06
variance of residuals (reduced chisquare) = WSSR/ndf   : 5.38629e-11

Final set of parameters            Asymptotic Standard Error
=======================            ==========================

a               = -0.00492909      +/- 0.0005009    (10.16%)
b               = 0.000248624      +/- 1.673e-05    (6.729%)
c               = -2.58813e-06     +/- 2.181e-07    (8.427%)
d               = 1.387e-08        +/- 1.428e-09    (10.3%)
e               = -4.0936e-11      +/- 4.982e-12    (12.17%)
f               = 6.34018e-14      +/- 8.827e-15    (13.92%)
g               = -4.02833e-17     +/- 6.237e-18    (15.48%)

fit.jpg
fit.jpg (27.04 KiB) Viewed 3758 times

Re: Excel needs to sod off and die

Posted: Tue Oct 20, 2009 2:50 pm UTC
by CortoPasta
I needed to extrapolate intermediate values for some collected data, so I graphed it in Excel (scatter graph) and fitted a 6th order Polynomial to the data (damn good fit) I planned to nick the equation for the polynomial to predict values between my data points... Problem is that it doesn't work, the stated equation for the polynomial doesn't fit by several orders of magnitude!!

Does anyone know how to fix the problem?

PS.Data
Spoiler:
Spoiler:
80 0.00402
100 0.004415
120 0.004585
140 0.004585
160 0.004545
180 0.0045
200 0.00444
220 0.004405
240 0.004375
260 0.004355
273.15 0.004334601
280 0.004335766
300 0.00434
350 0.004368
400 0.00437


Usually when my stated equation for a polynomial doesn't fit, it's because I used the dryer on a high heat setting. Next time check the tag and use tumble dry low...

Re: Excel needs to sod off and die

Posted: Tue Oct 20, 2009 11:23 pm UTC
by CrazyIvan
A good rule of thumb is you should only at max one explanatory variable for every 10 observations you have. Which means really, your data can manage a linear fit, and a second order fit is stretching. I've also never, ever seen someone give a plausible reason for a 6-order term besides "well, it fit well". Which is *never* a reason to include it in the model.

Bad statistics are bad. And shortcuts are just massively bad science.

Re: Excel needs to sod off and die

Posted: Sun Oct 25, 2009 11:19 pm UTC
by mouseposture
CrazyIvan wrote:A good rule of thumb is you should only at max one explanatory variable for every 10 observations you have. Which means really, your data can manage a linear fit, and a second order fit is stretching.


Agreed, but

CrazyIvan wrote: I've also never, ever seen someone give a plausible reason for a 6-order term besides "well, it fit well". Which is *never* a reason to include it in the model.


"It fit well" is a fine reason if, like the OP, all you want to do is extrapolate* the data, not understand it.
(But only ceteris paribus.)

*Wrote extrapolate, but apparently means interpolate.

Re: Excel needs to sod off and die

Posted: Wed Oct 28, 2009 1:46 am UTC
by CrazyIvan
mouseposture wrote:"It fit well" is a fine reason if, like the OP, all you want to do is extrapolate* the data, not understand it.
(But only ceteris paribus.)

*Wrote extrapolate, but apparently means interpolate.


Parsimony and model building is admittedly less of a big deal with a purely predictive model, but it still leaves a bad taste in my mouth. Doing anything with your data without a reason why gives me an uncomfortableness.

You get things like the OP's rounding problem as estimates get teeny tiny, a false idea of the precision of your prediction, and honestly, where do you stop? The OP fit a 6th order term. Why not a 7th? An 8th? Has he added negative or fractional power terms? If so, why not?

Those are questions you *should* be able to answer, that I very much doubt the OP can.

Re: Excel needs to sod off and die

Posted: Wed Oct 28, 2009 5:06 pm UTC
by headprogrammingczar
That data looks a lot like a decaying cosine function. Polynomials are never good approximations for trig functions.

Re: Excel needs to sod off and die

Posted: Thu Oct 29, 2009 2:51 am UTC
by mouseposture
CrazyIvan wrote:
Parsimony and model building is admittedly less of a big deal with a purely predictive model, but it still leaves a bad taste in my mouth. Doing anything with your data without a reason why gives me an uncomfortableness.


I think you're assuming the OP is doing science. He may be doing process engineering, in which case flagrant empiricism* is perfectly appropriate. In which case, OP's mistake is not failure to base a model on an understanding of the underlying physical process, but failure to choose a modelling technique (e.g. lowess) that's appropriate when it's not possible, or cost-effective, to gain such understanding.

*Not my cup of tea, but heck, somebody's got to keep those oil-refineries and doughnut-bakeries running.