Scipy - 优化。找出两个变量之间的比率

时间:2014-03-24 19:16:09

标签: python scipy best-fit-curve

我有3个变量; Market_Price,Hours,Age。

使用优化我找到了每个变量与Market_Price之间的关系。

数据:

hours =  [1000,  10000,  11000,  11000,  15000,  18000,  37000,  24000,  28000,  28000,  42000,  46000,  50000,  34000,  34000,  46000,  50000,  56000,  64000,  64000,  65000,  80000,  81000,  81000,  44000,  49000,  76000,  76000,  89000,  38000,  80000,  69000,  46000,  47000,  57000,  72000,  77000,  68000]

market_Price =  [30945,  28974,  27989,  27989,  36008,  24780,  22980,  23997,  25957,  27847,  36000,  25588,  23980,  25990,  25990,  28995,  26770,  26488,  24988,  24988,  17574,  12995,  19788,  20488,  19980,  24978,  16000,  16400,  18988,  19980,  18488,  16988,  15000,  15000,  16998,  17499,  15780,  8400]

age =  [2,  2,  2,  2,  2,  2,  2,  3,  3,  3,  3,  3,  3,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  5,  5,  5,  5,  5,  6,  6,  7,  8,  8,  8,  8,  8,  13,]

我得出的关系是:

营业时间_市场价格= log(h)* h1 + h2,

年龄到市场_价格=日志(a)* a1 + a2

使用Scipy的优化曲线拟合找到h1,h2,a1,a2。

现在我想将所有3个结合到一个计算中,根据年龄和时间我可以确定market_price。

到目前为止,我一直在做的方法是通过确定哪个组合具有最小的标准偏差来找到两者之间的比率。

std_divs = []
for ratio in ratios:    
    n = 0
    price_difference_final = []
    while n < len(prices):
        predicted_price = (log(h)*h1+h1)*ratio + (log(a)*a1+a1)*(1-ratio)
        price_difference_final.append(prices[n] - predicted_price)
        n += 1
    data = np.array(price_difference_final)
    std_divs.append(np.std(data))
std_div = min(std_divs)
optimum_ratio = ratios[std_divs.index(min(std_divs))]

正如你所看到的,我通过蛮力来实现这一点,这不是一个优雅的解决方案。

此外,现在我发现3之间的关系不能用单个比率表示, 相反,比率需要下滑。随着年份的增加,小时/年龄比率下降,使得年龄对市场价格的影响越来越大。

不幸的是,我无法使用Scipy的Curve Fit实现这一点,因为它只接受一对数组。

有没有想过如何才能最好地实现这一目标?

2 个答案:

答案 0 :(得分:2)

可以使用more than one dimension创建数组,在这种情况下,您可以将hoursage数据都传递到curve_fit。这样的例子可能是:

import numpy as np
from scipy.optimize import curve_fit

hours =  [1000,  10000,  11000,  11000,  15000,  18000,  37000,  24000,
          28000,  28000,  42000,  46000,  50000,  34000,  34000,  46000,
          50000,  56000,  64000,  64000,  65000,  80000,  81000,  81000,
          44000,  49000,  76000,  76000,  89000,  38000,  80000,  69000,
          46000,  47000,  57000,  72000,  77000,  68000]

market_Price =  [30945,  28974,  27989,  27989,  36008,  24780,  22980,
                 23997,  25957,  27847,  36000,  25588,  23980,  25990,  
                 25990,  28995,  26770,  26488,  24988,  24988,  17574,
                 12995,  19788,  20488,  19980,  24978,  16000,  16400,
                 18988,  19980,  18488,  16988,  15000,  15000,  16998,
                 17499,  15780,  8400]

age =  [2,  2,  2,  2,  2,  2,  2,  3,  3,  3,  3,  3,  3,  4,  4,  4,
        4,  4,  4,  4,  4,  4,  4,  4,  5,  5,  5,  5,  5,  6,  6,  7,  
        8,  8,  8,  8,  8,  13]

combined = np.array([hours, market_Price])

def f():
    # Some function which uses combined where
    # combined[0] = hours and combined[1] = market_Price
    pass

popt, pcov = curve_fit(f, combined, market_Price)

答案 1 :(得分:2)

这是多元回归问题,您不需要编写自己的代码,因为它已经存在:

http://wiki.scipy.org/Cookbook/OLS

注意:最后您没有5个参数h1, h2, a1, a2, ratio。您只有三个:h2*ratio+a2*(1-ratio) h1*ratio a1*(1-ratio)

In [26]:

y=np.array(market_Price)
x=np.log(np.array([hours, age])).T
In [27]:

mymodel=ols(y, x, 'Market_Price', ['Hours', 'Age'])
In [28]:

mymodel.p # return coefficient p-values
Out[28]:
array([  1.32065700e-05,   3.06318351e-01,   1.34081122e-05])
In [29]:

mymodel.summary()

==============================================================================
Dependent Variable: Market_Price
Method: Least Squares
Date:  Mon, 24 Mar 2014
Time:  15:40:00
# obs:                  38
# variables:         3
==============================================================================
variable     coefficient     std. Error      t-statistic     prob.
==============================================================================
const           45838.261850      9051.125823      5.064371      0.000013
Hours          -1023.097422      985.498239     -1.038152      0.306318
Age            -8862.186475      1751.640834     -5.059363      0.000013
==============================================================================
Models stats                         Residual stats
==============================================================================
R-squared             0.624227         Durbin-Watson stat   1.301026
Adjusted R-squared    0.602754         Omnibus stat         2.999547
F-statistic           29.070664         Prob(Omnibus stat)   0.223181
Prob (F-statistic)    0.000000          JB stat              1.807013
Log likelihood       -366.421766            Prob(JB)             0.405146
AIC criterion         19.443251         Skew                 0.376021
BIC criterion         19.572534         Kurtosis             3.758751
==============================================================================