使用pandas将JSON转换为CSV

时间:2018-05-28 01:32:40

标签: python json python-3.x pandas csv

我有一个JSON文件,其中包含46k +英文和其他语言的推文,我想保存为csv文件。下面是json文件的一部分。

    [{"user_id": 938118866135343104, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @PTISPOfficial: پاکستان تحریک انصاف کے وائس چیئرمین شاہ محمود قریشی  بغیر کسی پروٹوکول کے پاکستان سپر لیگ کا میچ دیکھنے کے لئے اسٹیڈیم م…", "tweet_id": 976166125502427136}
{"user_id": 959235642, "date_time": "03/20/2018 18:38:35", "tweet_content": "At last, Pakistan Have Witnessed The Most Thrilling Match Of Cricket In Pakistan, The Home. \n\n#PZvQG \n#ABC", "tweet_id": 976166125535973378}
{"user_id": 395163528, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @thePSLt20: SIX! 19.4 Liam Dawson to Anwar Ali\nWatch ball by ball highlights at (link removed)\n\n#PZvQG #HBLPSL #PSL2018 @_crici…", "tweet_id": 976166126202839040}
{"user_id": 3117825702, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @JeremyMcLellan: Rumor has it Amir Liaquat isn’t allowed to play in #PSL2018 because he keeps switching teams every week.", "tweet_id": 976166126483902466}
{"user_id": 3310967346, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @daniel86cricket: Peshawar beat Quetta by 1 run in one of the best T20 thrillers. PSL played in front of full house in Lahore Pakistan i…", "tweet_id": 976166126559354880}
{"user_id": 701494826194354179, "date_time": "03/20/2018 18:38:35", "tweet_content": "I wanted a super over\n#PZvQG", "tweet_id": 976166126836178944}
{"user_id": 347132028, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @hinaparvezbutt: Congratulations Peshawar Zalmi over great win but Quetta Gladiators won our hearts ♥️  #PZvQG", "tweet_id": 976166126685171713}
{"user_id": 3461853618, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @walterMiitty: It's harder than I thought to tell the truth\nIt's gonna leave you in pieces\nAll alone with your demons\nAnd I know that we…", "tweet_id": 976166126924201986}]

我按照此solution将其转换为CSV,但在urdu推文上获得了无效的语法错误。 我也试过这个:

    import json
with open("PeshVsQuetta.json") as f:
all_tweets = []
for line in f:
    text_dict = json.loads(line)
    all_tweets.append(text_dict)

print(all_tweets[0]['tweet_content'])

给了我以下错误。

    UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 148: character maps to <undefined>

我甚至将json文件保存为txt文件并尝试了这个:

    import pandas as pd
    from ast import literal_eval
    columns = ['Tweet ID','Author ID','Tweet','Time']
    df1 = pd.DataFrame(columns = columns)
    f = open('PeshvsQuetta.txt',encoding = 'utf-8')
    counter = 1
    for line in f:
         if(counter != 1):
             s1 = literal_eval(line)
             ser = pd.Series([s1['tweet_id'],s1['user_id'],s1['tweet_content'],s1["date_time"]],index=['Tweet ID','Author ID','Tweet','Time'])
             df1 = df1.append(ser,ignore_index=True)
    counter = counter + 1
    df1.to_csv('PeshVsQuetta1.csv', encoding='utf-8',index=False,columns = columns)

但是生成的csv文件将每个系列保存在一个单元格中,并且它有很多空行,一些推文保存在多行中。下面是图片。

enter image description here 任何帮助都会得到真正的赞赏。

1 个答案:

答案 0 :(得分:0)

您应该能够如下使用Pandas:

import pandas as pd

with open('PeshVsQuetta.json', encoding='utf-8-sig') as f_input:
    df = pd.read_json(f_input)

df.to_csv('PeshVsQuetta.csv', encoding='utf-8', index=False)

这假设您的JSON文件在开始时包含BOM。对于您上面给出的数据,这将生成以下CSV文件:

date_time,tweet_content,tweet_id,user_id
2018-03-20 18:38:35,RT @PTISPOfficial: پاکستان تحریک انصاف کے وائس چیئرمین شاہ محمود قریشی  بغیر کسی پروٹوکول کے پاکستان سپر لیگ کا میچ دیکھنے کے لئے اسٹیڈیم م…,976166125502427136,938118866135343104
2018-03-20 18:38:35,"At last, Pakistan Have Witnessed The Most Thrilling Match Of Cricket In Pakistan, The Home. 

#PZvQG 
#ABC",976166125535973378,959235642
2018-03-20 18:38:35,"RT @thePSLt20: SIX! 19.4 Liam Dawson to Anwar Ali
Watch ball by ball highlights at (link removed)

#PZvQG #HBLPSL #PSL2018 @_crici…",976166126202839040,395163528
2018-03-20 18:38:35,RT @JeremyMcLellan: Rumor has it Amir Liaquat isn’t allowed to play in #PSL2018 because he keeps switching teams every week.,976166126483902466,3117825702
2018-03-20 18:38:35,RT @daniel86cricket: Peshawar beat Quetta by 1 run in one of the best T20 thrillers. PSL played in front of full house in Lahore Pakistan i…,976166126559354880,3310967346
2018-03-20 18:38:35,"I wanted a super over
#PZvQG",976166126836178944,701494826194354179
2018-03-20 18:38:35,RT @hinaparvezbutt: Congratulations Peshawar Zalmi over great win but Quetta Gladiators won our hearts ♥️  #PZvQG,976166126685171713,347132028
2018-03-20 18:38:35,"RT @walterMiitty: It's harder than I thought to tell the truth
It's gonna leave you in pieces
All alone with your demons
And I know that we…",976166126924201986,3461853618

注意:您的某些字段包含换行符,因此输出可能看起来有些奇怪。读取此文件的应用程序将正确处理它(只要您告诉它在导入时编码为UTF-8)