要导入的数据集中包含大量NaN
值。同样我在Python中使用SurPRISE包(由Nicholas Hug编写)而不是使用Pandas。原因是预测NaN值的方法很好,所提到的包。
数据集 post_df1.csv 如下所述:
uid iid rat
1 303.0 785.0 3.000000
2 291.0 1042.0 4.000000
3 234.0 1184.0 2.000000
4 102.0 768.0 2.000000
5 181.0 1081.0 1.000000
...
194 944.0 110.0 NaN
195 944.0 111.0 NaN
196 944.0 112.0 NaN
197 944.0 113.0 NaN
198 944.0 114.0 5.000000
199 944.0 115.0 5.000000
使用SurPRISE
导入它reader = Reader(line_format="user item rating", sep='\t', rating_scale=(1, 5))
df = Dataset.load_from_file('post_df1.csv', reader=reader)
返回错误:
Traceback (most recent call last):
File "<input>", line 3, in <module>
File "/home/x/.local/lib/python2.7/site-packages/surprise/dataset.py", line 173, in load_from_file
return DatasetAutoFolds(ratings_file=file_path, reader=reader)
File "/home/x/.local/lib/python2.7/site-packages/surprise/dataset.py", line 306, in __init__
self.raw_ratings = self.read_ratings(self.ratings_file)
File "/home/x/.local/lib/python2.7/site-packages/surprise/dataset.py", line 205, in read_ratings
itertools.islice(f, self.reader.skip_lines, None)]
File "/home/x/.local/lib/python2.7/site-packages/surprise/dataset.py", line 455, in parse_line
return uid, iid, float(r) + self.offset, timestamp
ValueError: could not convert string to float:
我无法弄清楚,字符串在哪里!自从使用Pandas读取 post_df1.csv 后,返回:
post_df1.dtypes
uid float64
iid float64
rat float64
dtype: object
返回uid,iid,float(r)+ self.offset,timestamp 3.列出项目
所以,这里是 post_df1 &amp; post_df2 成立。同样对于 post_df1 ,我尝试从第1行开始获取值,以防第0行是标题。
# PRE PROCESSED CLUSTER 0 -- Named to POST DataFrame1
if flag1 is 1:
print pre_df01
post_df1 = pre_df01.iloc[1:, :]
elif flag1 is 2:
print pre_df02
post_df1 = pre_df02.iloc[1:, :]
elif flag1 is 3:
print pre_df03
post_df1 = pre_df03.iloc[1:, :]
# PRE PROCESSED CLUSTER 1 -- Named to POST DataFrame2
if flag2 is 1:
print pre_df11
post_df2 = pre_df11
elif flag2 is 2:
print pre_df12
post_df2 = pre_df12
elif flag2 is 3:
print pre_df13
post_df2 = pre_df13
在这里,我已经尝试删除标题和索引以避免其中的任何字符串类型。
# EXPORT TO CSV & LOAD AGAIN IN PROGRAM
post_df1.to_csv("post_df1.csv", sep='\t', index=False, header=False)
post_df2.to_csv("post_df2.csv", sep='\t', index=False, header=False)
答案 0 :(得分:2)
由于post_df1.csv中每列的标题(字符串格式),似乎出现此错误。当您从csv文件中删除包含列名的第一行时,您的代码片段应该正常工作。