Question

我有许多json文件要合并并输出为单个csv（加载到R中），每个json文件大约1.5gb。在对每个250mb的4-5个json文件进行试用时，我在下面得到以下错误。我在8gb ram和Windows 7专业版64位上运行Python版本'2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'。

我是一名Python新手，并且在编写优化代码方面缺乏经验，并希望获得有关如何优化我的脚本的指导。谢谢！

======= Python MemoryError =======

Traceback (most recent call last):
  File "C:\Users\...\tweetjson_to_csv.py", line 52, in <module>
    for line in file:
MemoryError
[Finished in 29.5s]

======= json到csv转换脚本=======

# csv file that you want to save to
out = open("output.csv", "ab")

filenames = ["8may.json", "9may.json", "10may.json", "11may.json", "12may.json"]
open_files = map(open, filenames)

# change argument to the file you want to open
for file in open_files:
    for line in file:
        # only keep tweets and not the empty lines
        if line.rstrip():
            try:
                tweets.append(json.loads(line))
            except:
                pass

for tweet in tweets:
    ids.append(tweet["id_str"])
    texts.append(tweet["text"])
    time_created.append(tweet["created_at"])
    retweet_counts.append(tweet["retweet_count"])
... ...

print >> out, "ids,text,time_created,retweet_counts,in_reply_to,geos,coordinates,places,country,language,screen_name,followers,friends,statuses,locations"
rows = zip(ids,texts,time_created,retweet_counts,in_reply_to_screen_name,geos,coordinates,places,places_country,lang,user_screen_names,user_followers_count,user_friends_count,user_statuses_count,user_locations)

csv = writer(out)

for row in rows:
    values = [(value.encode('utf8') if hasattr(value, 'encode') else value) for value in row]
    csv.writerow(values)

out.close()

Answer 1

这一行就在这里：

open_files = map(open, filenames)

同时打开每个文件。

然后你阅读所有内容并将其拼入同一个数组tweets。

你有两个主要的for循环，所以每条推文（其中有几个GB值）通过两次迭代4次！因为您在zip函数中添加了然后迭代以写入文件。这些点中的任何一个都可能是内存错误的原因。

除非绝对必要，否则请尝试仅触摸每一段数据。在遍历文件时，处理该行并立即将其写出来。

尝试这样的事情：

out = open("output.csv", "ab")

filenames = ["8may.json", "9may.json", "10may.json", "11may.json", "12may.json"]

def process_tweet_into_line(line):
    # load as json, process turn into a csv and return
    return line

# change argument to the file you want to open
for name in file_names:
    with open(name) as file:
        for line in file:
            # only keep tweets and not the empty lines
            if line.rstrip():
                try:
                    tweet = process_tweet_into_line(line)
                    out.write(line)
                except:
                    pass

Python中的MemoryError：如何优化我的代码？

1 个答案: