Question

我试图通过不同的python脚本从当前正在高速写入的文件中读取。文件中有大约70,000行。当我尝试阅读这些内容时，在申请退出之前，我通常会达到~7,750。

我认为这是由于正在写入文件（仅附加）。我已经处理了较大的文件（20k行），但只是在没有写入的情况下。

我可以采取哪些步骤进一步排查？ 如果目前正在撰写此文件，我该如何阅读？

我是Python的新手。感谢任何/所有帮助。

tweets_data = []
tweets_file = open(tweets_data_path, "r")
i = 0
for line in tweets_file:
    try:
        tweet = json.loads(line)
        tweets_data.append(tweet)
        i += 1
        if i % 250 == 0:
            print i
    except:
        continue

## Total # of tweets captured
print len(tweets_data)

Python 2.7
Ubuntu 14.04

回溯：我每次阅读都会得到这个

    ValueError: No JSON object could be decoded
    Traceback (most recent call last):
       File "data-parser.py", line 33, in <module>
         tweet = json.loads(line)
       File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
         return _default_decoder.decode(s)
       File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
       File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
         raise ValueError("No JSON object could be decoded")

更新

我修改了我的代码，以遵循@JanVlcinsky提出的建议。我已经确定问题不在于文件正在被写入。在下面的代码中，如果我注释掉tweets_data.append(tweet)，或者如果我添加一个条件以便推文只被频繁地添加到数组中，那么我的程序将按预期工作。但是，如果我尝试读取所有~90,000行，我的应用程序会过早退出。

    tweets_data = []
    with open(tweets_data_path) as f:
        for i, line in enumerate(f):
            if i % 1000 == 0:
                print "line check: ", str(i)
            try:
                ## Skip "newline" entries
                if i % 2 == 1:
                    continue
                ## Load tweets into array
                tweet = json.loads(line)
                tweets_data.append(tweet)
            except Exception as e:
                print e
                continue

    ## Total # of tweets captured
    print "decoded tweets: ", len(tweets_data)
    print str(tweets_data[0]['text'])

过早退出输出：

将每个有效行加载到数组中时......

...
line check:  41000
line check:  42000
line check:  43000
line check:  44000
line check:  45000
dannyb@twitter-data-mining:/var/www/cmd$

将所有其他有效行加载到数组中时......

...
line check:  86000
line check:  87000
line check:  88000
dannyb@twitter-data-mining:/var/www/cmd$

将每三个有效行加载到数组中时......

...
line check:  98000
line check:  99000
line check:  100000
line check:  101000
decoded tweets:  16986

最终让我相信这个问题与阵列的大小和我的可用资源有关？（在具有1GB RAM的VPS上）

FINAL： 加倍RAM修复了这个问题。我的Python脚本似乎超出了可用的RAM量。作为后续工作，我开始研究提高内存RAM效率的方法，以及增加脚本可用RAM总量的方法。

Answer 1

我认为，您从连续附加文件中读取推文的计划应该有效。

您可能会看到代码中可能会有一些惊喜。

修改您的代码，如下所示：

import json
tweets_data = []
with open("tweets.txt") as f:
    for i, line in enumerate(f):
        if i % 250 == 0:
            print i
        line = line.strip()
        # skipping empty lines
        if not len(line):
            continue
        try:
            tweet = json.loads(line)
            tweets_data.append(tweet)
        except MemoryError as e:
            print "Run out of memory, bye."
            raise e
        except Exception as e:
            print e
            continue

## Total # of tweets captured
print "decoded tweets", len(tweets_data)

修改：

with open...：无论之后会发生什么，这都是关闭文件的好习惯打开它。
for i, line in enumerate(f): - enumerate将为每个项目生成越来越多的数字迭代f
将第250行的打印件移到前面。这可能会揭示，你真的读了很多行，但也是如此其中许多都不是有效的JSON对象。什么时候打印放在json.loads之后，你可能会错过计数行，但解码失败。
except Exception as e:像以前一样抓住任何异常都是坏习惯关于这个问题的宝贵信息隐藏在你的眼前。你将在实际运行中看到打印的异常将帮助您理解问题。

编辑：添加了空行（不会因空行经常出现而进行重新录制。

Aslo为MemoryError添加了直接捕获，以防万一，我们用完RAM。

EDIT2：重写使用列表理解（不确定，如果这会优化使用的RAM）。它假设，所有非空行都是有效的JSON字符串，并且它不打印报告进展：

import json
with open("tweets.txt") as f:
    tweets_data = [json.loads(line)
                   for line in f
                   if len(line.strip())]

## Total # of tweets captured
print "decoded tweets", len(tweets_data)

由于没有append操作，它可能会比之前的版本运行得更快。

构建大量对象会导致Python脚本退出而不记录错误

1 个答案: