Question

我想知道在python 3上打开使用tweepy流式传输的大型Twitter文件的正确脚本是什么。我已经将以下文件用于较小的文件，但是现在我的数据收集超过30GB，我遇到了内存错误：

with open('data.txt') as f:
    tweetStream = f.read().splitlines()

tweet = json.loads(tweetStream[0])
print(tweet['text'])
print(tweet['user']['screen_name'])

到目前为止，我一直无法在网上找到我需要的东西，因此我们将不胜感激。

Answer 1

请勿尝试创建包含整个文件的对象。相反，由于每一行都包含一条推文，因此一次只处理一行文件：

with open('data.txt') as f:
    for line in f:
        tweet = json.loads(line)
        print(tweet['text'])
        print(tweet['user']['screen_name'])

也许将相关的推文存储到另一个文件或数据库中，或产生静态的总和。例如：

total = 0
about_badgers = 0
with open('data.txt') as f:
    for line in f:
        tweet = json.loads(line)
        total +=1
        if "badger" in tweet['text'].lower():
            about_badgers += 1

print("Of " + str(total) +", " + str(about_badgers) +" were about badgers.")

捕获与不可解析的行有关的错误，如下所示：

with open('data.txt') as f:
    for line in f:
        try:
            tweet = json.loads(line)
            print(tweet['text'])
         except json.decoder.JSONDecodeError:
            # Do something useful, like write the failing line to an error log
            pass

        print(tweet['user']['screen_name'])

如何在Python中打开大型Twitter文件（30GB +）？

1 个答案: