我正在尝试使用线性回归来基于三列(即我的“ X”(genre1,genre2和genre3))来预测电影的averageRate,我删除了其余的列,而我的“ Y”将是averageRating列。我已将流派转换为代表它的数字。那是我的第一个机器学习项目,我不确定线性回归是否适合这种情况。
我试图在列表中添加(genre1,genre2,genre3)某种类别的所有场合,例如“ Comedy:6565”,并将该列表用作我的X。 我尝试使用OneHotEnconding,但是为每个类别创建一列会很繁琐。
import pandas as pd
from collections import Counter
from sklearn.linear_model import LinearRegression
reg = LinearRegression() #Instance
list1 = [[105180,1971.0,"Sono un marito infedele","Êtes-vous fiancée à un marin grec ou à un pilote de ligne?",0,96.0,"Comedy","","","IT",0.0,42.0,5.3]]
list2 = [[34325,1942.0,"Que viene el coco","The Boogie Man Will Get You",0,66.0,"Comedy","Horror","","ES",0.0,682.0]]
train = pd.DataFrame(list1,columns=["id","startYear","title","originalTitle","isAdult","runtimeMinutes","genre1","genre2","genre3","region","isOriginalTitle","numVotes","averageRating"])#trainfile
test = pd.DataFrame(list2,columns=["id","startYear","title","originalTitle","isAdult","runtimeMinutes","genre1","genre2","genre3","region","isOriginalTitle","numVotes"])
#Test file
train["genre1"].fillna(-1, inplace=True)#replacing NaN
train["genre2"].fillna(-1, inplace=True)#replacing NaN
train["genre3"].fillna(-1, inplace=True)#replacing NaN
l1 = list(train["genre1"].unique())
l2 = list(train["genre2"].unique())
l3 = list(train["genre3"].unique())
genres =list(set(l1)|set(l2)|set(l3))
col = ["genre1","genre2","genre3"]
di = {}#dic with {0:Comedy,1:Action...}
for f in range(len(genres)):
di[f] = genres[f]
for colum in col:
for gen in range(len(genres)):
train[train[colum] == genres[gen]] = gen
X = train.drop(["genre3","genre2","id","startYear","title","originalTitle","isAdult","runtimeMinutes","region","isOriginalTitle","numVotes","averageRating"],axis=1)#Columns that I've dropped
y = train.averageRating
reg.fit(X,y) #trying to fit
y_pred=reg.predict(test)
submission = pd.DataFrame()
submission["id"] = test["id"]
submission["averageRating"] = y_pred
submission.to_csv("submission.csv", index = None)#creating an submission in csv
我希望结果中包含一个名为“ submission.csv”的新文件,但出现错误“ ValueError:无法将字符串转换为浮点数: original_Title_of_a_movie_in_this_place(测试中的“ originalTitle”)”。
我该如何解决,只使用测试中的流派而不阅读所有列。我应该使用测试列中的下拉菜单吗?
答案 0 :(得分:0)
您在代码中缺少某些内容。
X
仅包含一个值,也包含y
。
接下来,test
是一个[1,12]矩阵。
在test
中,您有一个来自标题的字符串。
这会导致错误
In [18]: print X
genre1
0 1
In [19]: print y
0 1.0
Name: averageRating, dtype: float64
In [20]: print test
id startYear title ... region isOriginalTitle numVotes
0 34325 1942.0 Que viene el coco ... ES 0.0 682.0
[1 rows x 12 columns]