Question

目标：根据4个房屋的2个特征（即Feature1和Feature2），预测每平方英尺的价格。我有7个带有feature1，feature2和每平方英尺价格的房屋。最后4个房屋只有“ feature1”和“ feature2”。我知道那里应该有什么价值。当我将其与[预测值]进行比较时，情况完全不同。

我的代码-我有一个CSV文件，我将其读取并将其转换为熊猫数据框，然后使用LinearRegression从中训练和测试模型。

数据-这是我的数据的快照，这是我正在使用的数据，我需要预测最后4个“ Pricepersqrft”值。

问题- 我无法获得超过10％的准确度，这意味着我没有为最近的4个房屋获得正确的“ Pricepersqrft”。

这是我的代码-

import numpy as np
import pandas as pd
import scipy 
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import datasets

csvfileData = THE DATA SHOWN IN THE SNAPSHOT
dataRead = pd.read_csv(csvfileData)
dfCreated = pd.DataFrame(dataRead) #creating a pandas dataframe
print(dfCreated)
# print(dfCreated.head()) #shows first 5 rows of data frame

dfCreated.drop(dfCreated.columns[[0]], axis=1, inplace = True)
print(dfCreated)

# where_are_NaNs = numpy.isnan(dfCreated) #previous line displayed Nan where no value was present for "Pricepersqrft column"
# dfCreated[where_are_NaNs] = 0 #use numpy's isnan and set all Nan to 0
# print(dfCreated)
dfCreated.hist(bins = 10, figsize=(20,15)) #plotting histograms using matplotlib
plt.show()

#creating scatter plots 
dfCreated.plot(kind="scatter", x= "Feature1", y="Feature2", alpha=0.5)
correlationMatrix = dfCreated.corr() #computes correlation between 2 columns 
print(correlationMatrix["Feature1"].sort_values(ascending=False))

#value that needs to be predicted
Y= dfCreated['Pricepersqrft']
print(Y)  

#training the model and testing, train_test_split expects both dataframes to be of same length
X_train, X_test, Y_train, Y_test = train_test_split(dfCreated, Y, test_size=0.20, random_state=0)
print(X_train.shape, Y_train.shape)
print(X_test.shape, Y_test.shape)

reg = LinearRegression()
reg.fit(X_train, Y_train)
#predictions = reg.predict(X_test)
#print(predictions)
reg.score(X_test, Y_test)

最后四个“ Pricepersqrft”的值分别为105.22、142.68、132.94和129.71

Answer 1

您正在使用的pd.read_csv仅返回pandas DataFrame，因此无需使用pd.DataFrame。
您正在对整个数据进行随机测试，如何确定将最后的观察结果作为测试数据？
将您要预测的所有观测值作为测试数据，将其他观测值作为训练数据。此外，如果您在此处显示的数据是全部，那么由于观察的次数较少，因此预测可能不是很好。

将iloc用于n个行的基于整数位置的索引。

train_data = data.iloc[0:m]
test_data = data.iloc[m:n+1]

我的线性回归模型显示得分为10％，我该如何提高？

1 个答案: