Question

我将训练和测试数据集分开，其中包含有关大脑和体重的信息。我想做的是在从训练数据集中学习后，通过测试数据集中给定的体重来预测测试数据集中的脑重量。我已完成linear regression但数据未能提供可接受的结果，因为数据分布不均匀。

我们如何使用scikit-learn训练“训练数据集”以预测单列测试数据集？下面的数组仅用于演示。

    Training['Brain'] = [3.385, .480, 1.350, 465.00,36.330, 27.660, 14.830, 1.040, 4.190, 0.425, 0.101, 0.920, 1.000, 0.005, 0.060, 3.500 ]

    Training['Body'] = [44.500, 15.5, 8.1, 423, 119.5, 115, 98.2, 5.5,58, 6.40, 4, 5.7,6.6, 140,1, 10.8] 

    Test['Brain'] = [192.000,3.000,160.000,0.900,1.620,0.104,4.235]
    Test['Body'] = [180.000,25.000,169.000,2.600,11.400,2.500,50.400]




import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats



training = pd.read_csv('C:\\training.csv', index_col='Index')

test = pd.read_csv('C:\\test.csv', index_col='Index')


train_x = training['Brain']
train_y = training['Body']

slope, intercept, r_value, p_value, std_err = stats.linregress(train_x, train_y)


fig, ax = plt.subplots(figsize=(20,10))
plt.axis([-10, 600, -10, 700])

plt.plot(train_x, train_y, 'ro', color='blue')
plt.ylabel('Body')
plt.xlabel('Brain')

plt.plot(train_x, train_x*slope+intercept, 'black')

plt.plot()
plt.show()



newX = test['Body']



newY = newX * slope+intercept

print(newX)
print(newY)
print(std_err)

Answer 1

我建议你不要把任何随机算法扔到数据上，因为你问过我想表达我的看法。您应该选择正确的算法以获得良好的结果。同时我在这里给出线性回归的例子，类似的预测可以用其他算法完成。所有输入都是数组形状。

Test_x=np.array(Test['Brain']).reshape(-1, 1) Test_y=np.array(Test['Body']).reshape(-1, 1) Train_x=np.array(Train['Brain']).reshape(-1, 1) Train_y=np.array(Train['Body']).reshape(-1, 1)

from sklearn.linear_model import LinearRegression

LinReg=LinearRegression()

LinReg.fit(Train_x,Train_y)
LinReg.predict(Test_x)

同样基于您的评论：是的，您尝试从数据集中删除异常值，然后您可以拟合多项式曲线。我在移除异常值后附加曲线。您可以看到非线性趋势。图：1有异常值 - 图：2没有异常值 -

关于身脑预测数据集的机器学习

1 个答案: