Python中的MemoryError:如何优化我的代码?

时间:2014-05-15 02:29:44

标签: python memory iteration

我有许多json文件要合并并输出为单个csv(加载到R中),每个json文件大约1.5gb。在对每个250mb的4-5个json文件进行试用时,我在下面得到以下错误。我在8gb ram和Windows 7专业版64位上运行Python版本'2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'

我是一名Python新手,并且在编写优化代码方面缺乏经验,并希望获得有关如何优化我的脚本的指导。谢谢!

======= Python MemoryError =======

Traceback (most recent call last):
  File "C:\Users\...\tweetjson_to_csv.py", line 52, in <module>
    for line in file:
MemoryError
[Finished in 29.5s]

======= json到csv转换脚本=======

# csv file that you want to save to
out = open("output.csv", "ab")

filenames = ["8may.json", "9may.json", "10may.json", "11may.json", "12may.json"]
open_files = map(open, filenames)

# change argument to the file you want to open
for file in open_files:
    for line in file:
        # only keep tweets and not the empty lines
        if line.rstrip():
            try:
                tweets.append(json.loads(line))
            except:
                pass

for tweet in tweets:
    ids.append(tweet["id_str"])
    texts.append(tweet["text"])
    time_created.append(tweet["created_at"])
    retweet_counts.append(tweet["retweet_count"])
... ...

print >> out, "ids,text,time_created,retweet_counts,in_reply_to,geos,coordinates,places,country,language,screen_name,followers,friends,statuses,locations"
rows = zip(ids,texts,time_created,retweet_counts,in_reply_to_screen_name,geos,coordinates,places,places_country,lang,user_screen_names,user_followers_count,user_friends_count,user_statuses_count,user_locations)

csv = writer(out)

for row in rows:
    values = [(value.encode('utf8') if hasattr(value, 'encode') else value) for value in row]
    csv.writerow(values)

out.close()

1 个答案:

答案 0 :(得分:3)

这一行就在这里:

open_files = map(open, filenames)

同时打开每个文件。

然后你阅读所有内容并将其拼入同一个数组tweets

你有两个主要的for循环,所以每条推文(其中有几个GB值)通过两次迭代4次!因为您在zip函数中添加了然后迭代以写入文件。这些点中的任何一个都可能是内存错误的原因。

除非绝对必要,否则请尝试仅触摸每一段数据。在遍历文件时,处理该行并立即将其写出来。

尝试这样的事情:

out = open("output.csv", "ab")

filenames = ["8may.json", "9may.json", "10may.json", "11may.json", "12may.json"]

def process_tweet_into_line(line):
    # load as json, process turn into a csv and return
    return line

# change argument to the file you want to open
for name in file_names:
    with open(name) as file:
        for line in file:
            # only keep tweets and not the empty lines
            if line.rstrip():
                try:
                    tweet = process_tweet_into_line(line)
                    out.write(line)
                except:
                    pass