有人可以说在熊猫上阅读CSV文件有什么问题吗

时间:2019-04-07 10:45:53

标签: python pandas csv dataframe

我正在使用一个名为parser.py的脚本将json数据转换为csv文件,并在另一个名为Analyzer.py的脚本中对其进行计数。我的问题是输出的CSV文件正确,但是当我尝试在analyzer.py中读取文件时,数据框中的其他行已损坏,并且新行未遵循列的顺序,并且没有值。对不起,英语不好,我对此感到很难过:(

CSV列:

 status_id, created_at, user_id, user_screen_name, status_text, hashtags,
 user_metions, url, status_fav_count, status_rewtweet_count, is_retweet, 
 ori_status_id, ori_creted_at, ori_user_id, ori_user_screen_name, 
 ori_text, ori_hashtags, ori_user_metions, ori_urls, ori_fav_count,
 ori_rewtweet_count, is_quoted, quoted_status_id, quoted_status_creted_at,
 quoted_status_user_id, quoted_status_screen_name, quoted_text,
 quoted_hashtags, quoted_user_metions, quoted_urls, quoted_fav_count, 
 quoted_rewtweet_count

示例:

  

1106517910707679235,2019-03-15 11:30:02,19888170,Cout_ma,RT @Marish_:   kkkkkkkkkkkkkkkkkkkkkkkk,,0,0,True,110651 6443468845061,星期五   15 11:24:12 +0000 2019,61990620,Marish_,kkkkkkkkkkkkkkkkkkkkkkkk   ,,,, 5,6,True,1106513884314324992,星期五3月15日11:14:02 +0000   2019,14594813,Folha,Ruralistas reclamam devi�〜C©s anti-China no   总督Bolsonaro ,,, 160,34

在执行读取测试时输出:

  

Pandas(Index = 39498,status_id ='URL HERE',created_at = nan,user_id = nan,   user_screen_name ='URL HERE',status_text ='0',hashtags ='0',   user_metions ='False',url = nan,status_fav_count = nan,   status_rewtweet_count = nan,is_retweet = nan,ori_status_id = nan,   ori_creted_at = nan,ori_user_id = nan,ori_user_screen_name = nan,   ori_text = nan,ori_hashtags = nan,ori_user_metions =“ False”,   ori_urls = nan,ori_fav_count = nan,ori_rewtweet_count = nan,   is_quoted = nan,quoted_status_id = nan,quoted_status_creted_at = nan,   quoted_status_user_id = nan,quoted_status_screen_name = nan,   quoted_text = nan,quoted_hashtags = nan,quoted_user_metions = nan,   quoted_urls = nan,quoted_fav_count = nan,quoted_rewtweet_count = nan)

编写CSV代码:

    df = pandas.DataFrame(to_csv,columns=['status_id',
                               'created_at',
                               'user_id',
                               'user_screen_name',
                                   .
                                   .
                                   .
                                    ])
    df = df.sort_values(by='status_id')
    df.to_csv(to + index + '_' + start.strftime('%Y-%m-%d %H:%M:%S') + '.csv',index=False,encoding='utf8')

读取csv代码:

 data = pd.read_csv(path + '/' + name) # var name contains the csv file name
for i in data.itertuples():
   print(i)

1 个答案:

答案 0 :(得分:1)

确保tweet文本或其他字段不包含您使用的CSV分隔符(在这种情况下为逗号),否则在阅读这些行时将无法分辨分隔符是要分隔还是只是一个字符原始文本。 如果用引号引起来,则该问题不应持续。