选择最佳分段回归拟合时代码太慢

时间:2019-06-11 21:42:07

标签: python numpy scipy linear-regression

我正在制作一个程序,该程序可以在数据中最多包含4-5个断点的情况下进行分段线性回归,然后确定最适合防止过度和欠拟合的断点数。但是,由于代码不优雅,因此它们的运行速度极慢。

我的代码草稿如下:

import numpy as np
import pandas as pd
from scipy.optimize import curve_fit, differential_evolution
import matplotlib.pyplot as plt
import warnings


def segmentedRegression_two(xData,yData):

    def func(xVals,break1,break2,slope1,offset1,slope_mid,offset_mid,slope2,offset2):
            returnArray=[]
            for x in xVals:
                if x < break1:
                    returnArray.append(slope1 * x + offset1)
                elif (np.logical_and(x >= break1,x<break2)):
                    returnArray.append(slope_mid * x + offset_mid)
                else:
                    returnArray.append(slope2 * x + offset2)

            return returnArray

    def sumSquaredError(parametersTuple): #Definition of an error function to minimize
        model_y=func(xData,*parametersTuple)
        warnings.filterwarnings("ignore") # Ignore warnings by genetic algorithm

        return np.sum((yData-model_y)**2.0)

    def generate_genetic_Parameters():
            initial_parameters=[]
            x_max=np.max(xData)
            x_min=np.min(xData)
            y_max=np.max(yData)
            y_min=np.min(yData)
            slope=10*(y_max-y_min)/(x_max-x_min)

            initial_parameters.append([x_max,x_min]) #Bounds for model break point
            initial_parameters.append([x_max,x_min])
            initial_parameters.append([-slope,slope]) 
            initial_parameters.append([-y_max,y_min]) 
            initial_parameters.append([-slope,slope]) 
            initial_parameters.append([-y_max,y_min]) 
            initial_parameters.append([-slope,slope])
            initial_parameters.append([y_max,y_min]) 



            result=differential_evolution(sumSquaredError,initial_parameters,seed=3)

            return result.x

    geneticParameters = generate_genetic_Parameters() #Generates genetic parameters



    fittedParameters, pcov= curve_fit(func, xData, yData, geneticParameters) #Fits the data 
    print('Parameters:', fittedParameters)





    model=func(xData,*fittedParameters)

    absError = model - yData

    SE = np.square(absError) 
    MSE = np.mean(SE) 
    RMSE = np.sqrt(MSE) 
    Rsquared = 1.0 - (np.var(absError) / np.var(yData))




    return Rsquared

def segmentedRegression_three(xData,yData):

    def func(xVals,break1,break2,break3,slope1,offset1,slope2,offset2,slope3,offset3,slope4,offset4):
            returnArray=[]
            for x in xVals:
                if x < break1:
                    returnArray.append(slope1 * x + offset1)
                elif (np.logical_and(x >= break1,x<break2)):
                    returnArray.append(slope2 * x + offset2)
                elif (np.logical_and(x >= break2,x<break3)):
                    returnArray.append(slope3 * x + offset3)
                else:
                    returnArray.append(slope4 * x + offset4)

            return returnArray

    def sumSquaredError(parametersTuple): #Definition of an error function to minimize
        model_y=func(xData,*parametersTuple)
        warnings.filterwarnings("ignore") # Ignore warnings by genetic algorithm

        return np.sum((yData-model_y)**2.0)

    def generate_genetic_Parameters():
            initial_parameters=[]
            x_max=np.max(xData)
            x_min=np.min(xData)
            y_max=np.max(yData)
            y_min=np.min(yData)
            slope=10*(y_max-y_min)/(x_max-x_min)

            initial_parameters.append([x_max,x_min]) #Bounds for model break point
            initial_parameters.append([x_max,x_min])
            initial_parameters.append([x_max,x_min])
            initial_parameters.append([-slope,slope]) 
            initial_parameters.append([-y_max,y_min]) 
            initial_parameters.append([-slope,slope]) 
            initial_parameters.append([-y_max,y_min]) 
            initial_parameters.append([-slope,slope])
            initial_parameters.append([y_max,y_min]) 
            initial_parameters.append([-slope,slope])
            initial_parameters.append([y_max,y_min]) 



            result=differential_evolution(sumSquaredError,initial_parameters,seed=3)

            return result.x

    geneticParameters = generate_genetic_Parameters() #Generates genetic parameters



    fittedParameters, pcov= curve_fit(func, xData, yData, geneticParameters) #Fits the data 
    print('Parameters:', fittedParameters)





    model=func(xData,*fittedParameters)

    absError = model - yData

    SE = np.square(absError) 
    MSE = np.mean(SE) 
    RMSE = np.sqrt(MSE) 
    Rsquared = 1.0 - (np.var(absError) / np.var(yData))


    return Rsquared

def segmentedRegression_four(xData,yData):

def func(xVals,break1,break2,break3,break4,slope1,offset1,slope2,offset2,slope3,offset3,slope4,offset4,slope5,offset5):
        returnArray=[]
        for x in xVals:
            if x < break1:
                returnArray.append(slope1 * x + offset1)
            elif (np.logical_and(x >= break1,x<break2)):
                returnArray.append(slope2 * x + offset2)
            elif (np.logical_and(x >= break2,x<break3)):
                returnArray.append(slope3 * x + offset3)
            elif (np.logical_and(x >= break3,x<break4)):
                returnArray.append(slope4 * x + offset4)
            else:
                returnArray.append(slope5 * x + offset5)

        return returnArray

def sumSquaredError(parametersTuple): #Definition of an error function to minimize
    model_y=func(xData,*parametersTuple)
    warnings.filterwarnings("ignore") # Ignore warnings by genetic algorithm

    return np.sum((yData-model_y)**2.0)

def generate_genetic_Parameters():
        initial_parameters=[]
        x_max=np.max(xData)
        x_min=np.min(xData)
        y_max=np.max(yData)
        y_min=np.min(yData)
        slope=10*(y_max-y_min)/(x_max-x_min)

        initial_parameters.append([x_max,x_min]) #Bounds for model break point
        initial_parameters.append([x_max,x_min])
        initial_parameters.append([x_max,x_min])
        initial_parameters.append([x_max,x_min])
        initial_parameters.append([-slope,slope]) 
        initial_parameters.append([-y_max,y_min]) 
        initial_parameters.append([-slope,slope]) 
        initial_parameters.append([-y_max,y_min]) 
        initial_parameters.append([-slope,slope])
        initial_parameters.append([y_max,y_min]) 
        initial_parameters.append([-slope,slope])
        initial_parameters.append([y_max,y_min]) 
        initial_parameters.append([-slope,slope])
        initial_parameters.append([y_max,y_min]) 



        result=differential_evolution(sumSquaredError,initial_parameters,seed=3)

        return result.x

geneticParameters = generate_genetic_Parameters() #Generates genetic parameters



fittedParameters, pcov= curve_fit(func, xData, yData, geneticParameters) #Fits the data 
print('Parameters:', fittedParameters)





model=func(xData,*fittedParameters)

absError = model - yData

SE = np.square(absError) 
MSE = np.mean(SE) 
RMSE = np.sqrt(MSE) 
Rsquared = 1.0 - (np.var(absError) / np.var(yData))


return Rsquared

从这里开始,到目前为止一直在想这样的事情:

r2s=[segmentedRegression_two(xData,yData),segmentedRegression_three(xData,yData),segmentedRegression_four(xData,yData)]

best_fit=np.max(r2s)

尽管我可能需要使用AIC之类的东西。

有什么方法可以使运行效率更高?

1 个答案:

答案 0 :(得分:1)

我抓住了您的func之一,并将其​​放在测试脚本中:

import numpy as np

def func(xVals,break1,break2,break3,slope1,offset1,slope2,offset2,slope3,offset3,slope4,offset4):
    returnArray=[]
    for x in xVals:
        if x < break1:
            returnArray.append(slope1 * x + offset1)
        elif (np.logical_and(x >= break1,x<break2)):
            returnArray.append(slope2 * x + offset2)
        elif (np.logical_and(x >= break2,x<break3)):
            returnArray.append(slope3 * x + offset3)
        else:
            returnArray.append(slope4 * x + offset4)

    return returnArray

arr = np.linspace(0,20,10000)
breaks = [4, 10, 15]
slopes = [.1, .2, .3, .4]
offsets = [1,2,3,4]
sl_off = np.array([slopes,offsets]).T.ravel().tolist()
print(sl_off)
ret = func(arr, *breaks, *sl_off)
if len(ret)<25:
    print(ret)

然后我迈出了“向量化”的第一步,即按值块而不是逐个元素地评估函数。

def func1(xVals, breaks, slopes, offsets):
    res = np.zeros(xVals.shape)
    i = 0 
    mask = xVals<breaks[i]
    res[mask] = slopes[i]*xVals[mask]+offsets[i]
    for i in [1,2]:
        mask = np.logical_and(xVals>=breaks[i-1], xVals<breaks[i])
        res[mask] = slopes[i]*xVals[mask]+offsets[i]
    i=3
    mask = xVals>=breaks[i-1]
    res[mask] = slopes[i]*xVals[mask]+offsets[i]
    return res

ret1 = func1(arr, breaks, slopes, offsets)
print(np.allclose(ret, ret1))

allclose测试将打印True。我还在ranipython对其进行了计时,并对两个版本进行了计时。

In [41]: timeit func(arr, *breaks, *sl_off)                                                            
66.2 ms ± 337 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [42]: timeit func1(arr, breaks, slopes, offsets)                                                    
165 µs ± 586 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

我也做了plt.plot(xVals, ret)来查看该函数的简单图解。

我写func1的目的是使它适用于您的所有3种情况。它不存在,但是应该不难根据输入列表(或数组)的长度进行更改。

我相信可以做更多的事情,但这应该朝着正确的方向开始。

还有一个numpy piecewise评估者:

np.piecewise(x, condlist, funclist, *args, **kw)

但我看来,构造两个输入列表将同样有用。