目标:根据4个房屋的2个特征(即Feature1和Feature2),预测每平方英尺的价格。我有7个带有feature1,feature2和每平方英尺价格的房屋。最后4个房屋只有“ feature1”和“ feature2”。我知道那里应该有什么价值。当我将其与[预测值]进行比较时,情况完全不同。
我的代码-我有一个CSV文件,我将其读取并将其转换为熊猫数据框,然后使用LinearRegression从中训练和测试模型。
数据-这是我的数据的快照,这是我正在使用的数据,我需要预测最后4个“ Pricepersqrft”值。
问题- 我无法获得超过10%的准确度,这意味着我没有为最近的4个房屋获得正确的“ Pricepersqrft”。
这是我的代码-
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import datasets
csvfileData = THE DATA SHOWN IN THE SNAPSHOT
dataRead = pd.read_csv(csvfileData)
dfCreated = pd.DataFrame(dataRead) #creating a pandas dataframe
print(dfCreated)
# print(dfCreated.head()) #shows first 5 rows of data frame
dfCreated.drop(dfCreated.columns[[0]], axis=1, inplace = True)
print(dfCreated)
# where_are_NaNs = numpy.isnan(dfCreated) #previous line displayed Nan where no value was present for "Pricepersqrft column"
# dfCreated[where_are_NaNs] = 0 #use numpy's isnan and set all Nan to 0
# print(dfCreated)
dfCreated.hist(bins = 10, figsize=(20,15)) #plotting histograms using matplotlib
plt.show()
#creating scatter plots
dfCreated.plot(kind="scatter", x= "Feature1", y="Feature2", alpha=0.5)
correlationMatrix = dfCreated.corr() #computes correlation between 2 columns
print(correlationMatrix["Feature1"].sort_values(ascending=False))
#value that needs to be predicted
Y= dfCreated['Pricepersqrft']
print(Y)
#training the model and testing, train_test_split expects both dataframes to be of same length
X_train, X_test, Y_train, Y_test = train_test_split(dfCreated, Y, test_size=0.20, random_state=0)
print(X_train.shape, Y_train.shape)
print(X_test.shape, Y_test.shape)
reg = LinearRegression()
reg.fit(X_train, Y_train)
#predictions = reg.predict(X_test)
#print(predictions)
reg.score(X_test, Y_test)
最后四个“ Pricepersqrft”的值分别为105.22、142.68、132.94和129.71
答案 0 :(得分:2)
您正在使用的pd.read_csv仅返回pandas DataFrame,因此无需使用pd.DataFrame。
您正在对整个数据进行随机测试,如何确定将最后的观察结果作为测试数据?
将您要预测的所有观测值作为测试数据,将其他观测值作为训练数据。此外,如果您在此处显示的数据是全部,那么由于观察的次数较少,因此预测可能不是很好。
将iloc用于n个行的基于整数位置的索引。
train_data = data.iloc[0:m]
test_data = data.iloc[m:n+1]