我有一个回归问题的数据集。之前我认为这是线性回归问题,但是当我针对“ traffic_volume”绘制“ date_time”时,结果却像是 Sine 曲线,所以我决定选择“ 曲线拟合< / strong>”。这是代码:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import datetime as dt
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from scipy.optimize import leastsq
#import matplotlib.pyplot as plt
import pylab as plt
from scipy.optimize import curve_fit
df = pd.read_csv("Metro_Interstate_Traffic_Volume.csv")
df['holiday'].replace(to_replace = 'None', value = '0', inplace=True)
df.loc[df['holiday'] != '0', 'holiday'] = 1
print(df.shape)
df['date_time'] = pd.to_datetime(df['date_time'], format='%m/%d/%Y %H:%M')
df['date_time'] = (df['date_time']- dt.datetime(1970,1,1)).dt.total_seconds()
#print(df['date_time'].head())
non_dummy_cols = ['holiday','temp','rain_1h', 'snow_1h', 'clouds_all','date_time', 'traffic_volume']
dummy_cols = list(set(df.columns) - set(non_dummy_cols))
df = pd.get_dummies(df, columns=dummy_cols)
print(df.shape)
x = df[df.columns.values]
x = x.drop(['traffic_volume'], axis=1)
x = x.drop(['clouds_all'], axis = 1)
y = df['traffic_volume']
print(x.shape)
print(y.shape)
#plt.figure(figsize=(6,4))
#plt.scatter(df.date_time[0:100], df.traffic_volume[0:100], color = 'blue')
#plt.xlabel("Date Time")
#plt.ylabel("Traffic volume")
#plt.show()
x = StandardScaler().fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state= 4)
def my_sin(x, freq, amplitude, phase, offset):
return np.sin(x * freq + phase) * amplitude + offset
#x_train = np.array(x_train)
#y_train = np.array(y_train)
print(x_train)
popt, pcov = curve_fit(my_sin, x_train, y_train)
y_hat = my_sin(x_test, *popt)
现在,这种方式的问题是以下错误:
ValueError: operands could not be broadcast together with shapes (38563,54) (38563,)
我知道错误是由于x_train.shape引起的,因为它是m * n,而curve_fit只接受m,。当我尝试使用x_train而不是53中的仅一个特征来训练模型时,curve_fit模型起作用了,但结果却是一个可怕的训练模型。这是数据集链接:
数据集: Download
为了快速查看,这是数据集的前几行图像:
所以请帮助我训练这个模型,是否可以建议任何可以训练这个模型的算法?我应该删除或使用所有这些功能?我还尝试通过使用2级多项式回归拟合此模型,而使用3级时,我的电脑崩溃了几次。所以请帮帮我。
注意:我按照一位社区成员的说法重新提出了这个问题,因为上一个社区成员仅与curve_fit错误有关,而从标题中并不清楚。