我有许多json文件要合并并输出为单个csv(加载到R中),每个json文件大约1.5gb。在对每个250mb的4-5个json文件进行试用时,我在下面得到以下错误。我在8gb ram和Windows 7专业版64位上运行Python版本'2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'
。
我是一名Python新手,并且在编写优化代码方面缺乏经验,并希望获得有关如何优化我的脚本的指导。谢谢!
======= Python MemoryError =======
Traceback (most recent call last):
File "C:\Users\...\tweetjson_to_csv.py", line 52, in <module>
for line in file:
MemoryError
[Finished in 29.5s]
======= json到csv转换脚本=======
# csv file that you want to save to
out = open("output.csv", "ab")
filenames = ["8may.json", "9may.json", "10may.json", "11may.json", "12may.json"]
open_files = map(open, filenames)
# change argument to the file you want to open
for file in open_files:
for line in file:
# only keep tweets and not the empty lines
if line.rstrip():
try:
tweets.append(json.loads(line))
except:
pass
for tweet in tweets:
ids.append(tweet["id_str"])
texts.append(tweet["text"])
time_created.append(tweet["created_at"])
retweet_counts.append(tweet["retweet_count"])
... ...
print >> out, "ids,text,time_created,retweet_counts,in_reply_to,geos,coordinates,places,country,language,screen_name,followers,friends,statuses,locations"
rows = zip(ids,texts,time_created,retweet_counts,in_reply_to_screen_name,geos,coordinates,places,places_country,lang,user_screen_names,user_followers_count,user_friends_count,user_statuses_count,user_locations)
csv = writer(out)
for row in rows:
values = [(value.encode('utf8') if hasattr(value, 'encode') else value) for value in row]
csv.writerow(values)
out.close()
答案 0 :(得分:3)
这一行就在这里:
open_files = map(open, filenames)
同时打开每个文件。
然后你阅读所有内容并将其拼入同一个数组tweets
。
你有两个主要的for
循环,所以每条推文(其中有几个GB值)通过两次迭代4次!因为您在zip
函数中添加了然后迭代以写入文件。这些点中的任何一个都可能是内存错误的原因。
除非绝对必要,否则请尝试仅触摸每一段数据。在遍历文件时,处理该行并立即将其写出来。
尝试这样的事情:
out = open("output.csv", "ab")
filenames = ["8may.json", "9may.json", "10may.json", "11may.json", "12may.json"]
def process_tweet_into_line(line):
# load as json, process turn into a csv and return
return line
# change argument to the file you want to open
for name in file_names:
with open(name) as file:
for line in file:
# only keep tweets and not the empty lines
if line.rstrip():
try:
tweet = process_tweet_into_line(line)
out.write(line)
except:
pass