无法使用Python中的Surprise将csv文件加载到Dataframe中

时间:2017-10-04 10:49:10

标签: python python-2.7 pandas csv dataframe

情景

要导入的数据集中包含大量NaN值。同样我在Python中使用SurPRISE包(由Nicholas Hug编写)而不是使用Pandas。原因是预测NaN值的方法很好,所提到的包。

问题

数据集 post_df1.csv 如下所述:

       uid     iid       rat
1    303.0   785.0  3.000000
2    291.0  1042.0  4.000000
3    234.0  1184.0  2.000000
4    102.0   768.0  2.000000
5    181.0  1081.0  1.000000
...
194  944.0  110.0       NaN
195  944.0  111.0       NaN
196  944.0  112.0       NaN
197  944.0  113.0       NaN
198  944.0  114.0  5.000000
199  944.0  115.0  5.000000

使用SurPRISE

导入它
reader = Reader(line_format="user item rating", sep='\t', rating_scale=(1, 5))
df = Dataset.load_from_file('post_df1.csv', reader=reader)

返回错误:

Traceback (most recent call last):
  File "<input>", line 3, in <module>
  File "/home/x/.local/lib/python2.7/site-packages/surprise/dataset.py", line 173, in load_from_file
    return DatasetAutoFolds(ratings_file=file_path, reader=reader)
  File "/home/x/.local/lib/python2.7/site-packages/surprise/dataset.py", line 306, in __init__
    self.raw_ratings = self.read_ratings(self.ratings_file)
  File "/home/x/.local/lib/python2.7/site-packages/surprise/dataset.py", line 205, in read_ratings
    itertools.islice(f, self.reader.skip_lines, None)]
  File "/home/x/.local/lib/python2.7/site-packages/surprise/dataset.py", line 455, in parse_line
    return uid, iid, float(r) + self.offset, timestamp
ValueError: could not convert string to float: 

我无法弄清楚,字符串在哪里!自从使用Pandas读取 post_df1.csv 后,返回:

post_df1.dtypes

uid    float64
iid    float64
rat    float64
dtype: object

问题

  1. 使用此软件包阅读时可能会将整个数据视为字符串?
  2. 我在Error中注意到,float在数据集.py中有一个偏移量和时间戳作为返回值。我怎样才能将它限制在 uid,iid,rat / float 之外?
  3.   

    返回uid,iid,float(r)+ self.offset,timestamp    3.列出项目

    参考

    Suprise Package Docs

    编辑#1

    所以,这里是 post_df1 &amp; post_df2 成立。同样对于 post_df1 ,我尝试从第1行开始获取值,以防第0行是标题。

    # PRE PROCESSED CLUSTER 0 -- Named to POST DataFrame1
    if flag1 is 1:
        print pre_df01
        post_df1 = pre_df01.iloc[1:, :]
    elif flag1 is 2:
        print pre_df02
        post_df1 = pre_df02.iloc[1:, :]
    elif flag1 is 3:
        print pre_df03
        post_df1 = pre_df03.iloc[1:, :]
    
    # PRE PROCESSED CLUSTER 1 -- Named to POST DataFrame2
    if flag2 is 1:
        print pre_df11
        post_df2 = pre_df11
    elif flag2 is 2:
        print pre_df12
        post_df2 = pre_df12
    elif flag2 is 3:
        print pre_df13
        post_df2 = pre_df13
    

    在这里,我已经尝试删除标题和索引以避免其中的任何字符串类型。

    # EXPORT TO CSV & LOAD AGAIN IN PROGRAM
    post_df1.to_csv("post_df1.csv", sep='\t', index=False, header=False)
    post_df2.to_csv("post_df2.csv", sep='\t', index=False, header=False)
    

    由于导入是代码中的问题,我使用Spreadsheet查看了csv文件,这是它的外观 post_df1.csv in LibreOffice Calc 显然它没有标题。

1 个答案:

答案 0 :(得分:2)

由于post_df1.csv中每列的标题(字符串格式),似乎出现此错误。当您从csv文件中删除包含列名的第一行时,您的代码片段应该正常工作。