我正在使用tweepy的Streamlistener
来收集Twitter数据,而我正在使用的代码会生成带有大量元数据的JSONL文件。
现在我想将文件转换为CSV,我为此找到了一个代码。不幸的是我遇到了错误阅读:
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 7833)
我已阅读其他线程,我认为它与json.loads
无法处理json文件中的多个数据部分有关(当然,我的json列表文件就是这种情况)。
我如何在代码中绕过这个问题?或者我是否必须使用完全不同的方法来转换文件? (我正在使用python 3.6,我正在流式传输的推文主要是阿拉伯文)。
__author__ = 'seandolinar'
import json
import csv
import io
'''
creates a .csv file using a Twitter .json file
the fields have to be set manually
'''
data_json = io.open('stream_____.jsonl', mode='r', encoding='utf-8').read() #reads in the JSON file
data_python = json.loads(data_json)
csv_out = io.open('tweets_out_utf8.csv', mode='w', encoding='utf-8') #opens csv file
fields = u'created_at,text,screen_name,followers,friends,rt,fav' #field names
csv_out.write(fields)
csv_out.write(u'\n')
for line in data_python:
#writes a row and gets the fields from the json object
#screen_name and followers/friends are found on the second level hence two get methods
row = [line.get('created_at'),
'"' + line.get('text').replace('"','""') + '"', #creates double quotes
line.get('user').get('screen_name'),
unicode(line.get('user').get('followers_count')),
unicode(line.get('user').get('friends_count')),
unicode(line.get('retweet_count')),
unicode(line.get('favorite_count'))]
row_joined = u','.join(row)
csv_out.write(row_joined)
csv_out.write(u'\n')
csv_out.close()
答案 0 :(得分:1)
如果数据文件由多行组成,每行都是一个json对象,则可以使用生成器一次解码一行。
def extract_json(fileobj):
# Using "with" ensures that fileobj is closed when we finish reading it.
with fileobj:
for line in fileobj:
yield json.loads(line)
您的代码的唯一更改是data_json
文件未明确读取,data_python
是调用extract_json
而不是json.loads
的结果。这是修改后的代码:
import json
import csv
import io
'''
creates a .csv file using a Twitter .json file
the fields have to be set manually
'''
def extract_json(fileobj):
"""
Iterates over an open JSONL file and yields
decoded lines. Closes the file once it has been
read completely.
"""
with fileobj:
for line in fileobj:
yield json.loads(line)
data_json = io.open('stream_____.jsonl', mode='r', encoding='utf-8') # Opens in the JSONL file
data_python = extract_json(data_json)
csv_out = io.open('tweets_out_utf8.csv', mode='w', encoding='utf-8') #opens csv file
fields = u'created_at,text,screen_name,followers,friends,rt,fav' #field names
csv_out.write(fields)
csv_out.write(u'\n')
for line in data_python:
#writes a row and gets the fields from the json object
#screen_name and followers/friends are found on the second level hence two get methods
row = [line.get('created_at'),
'"' + line.get('text').replace('"','""') + '"', #creates double quotes
line.get('user').get('screen_name'),
unicode(line.get('user').get('followers_count')),
unicode(line.get('user').get('friends_count')),
unicode(line.get('retweet_count')),
unicode(line.get('favorite_count'))]
row_joined = u','.join(row)
csv_out.write(row_joined)
csv_out.write(u'\n')
csv_out.close()