Question

我有一个 265 MB 和 7504527 文字的文件。
这个循环耗费了大量时间：

B = namedtuple("B", ["id", "ch"])

def get_tuple_from_b_string(b_str):
    return B(int(b_str.split("_")[0]),int(b_str.split("_")[1]))

with open("/tmp/file.out") as test_out:
     for line in test_out:
         if line == '' or not line.startswith("["):
              continue
         bracket_index = line.find(']')
         b_str = line[1:bracket_index]
         content = line[bracket_index+1:]
         b_tuple = get_tuple_from_b_string(b_str)
         if not b_tuple in b_tupled_list:
             continue
         if not b_tuple in b_outputs:
             b_outputs[b_tuple] = ''
             b_outputs[b_tuple] += content+'\n'

我现在正在跑步，但在19:38分钟之后还没有完成我试图检查过程的strace并重复这些行：

mmap2（NULL，3145728，PROT_READ | PROT_WRITE，MAP_PRIVATE | MAP_ANONYMOUS，   -1,0xff9093fc）= 0xfffffffff4997000
  munmap（0xf678b000,31452728）

但不同的地址（我认为它没有被卡住，仍然在阅读） 我的问题是：

为什么需要很长时间（假设它没有卡住）？
如果卡住，我怎么能找到哪里？

文件内容示例：

[1_03]{
[1_03]    "0": {
[1_03]        "interfaces": {
[1_03]            "address": [],
[1_03]            "distribution": [],
[1_03]            "interface": [],
[1_03]            "netmask": []
[1_03]        },

Answer 1

一些性能优化：

def get_tuple_from_b_string(b_str):
    # do not split the string twice, map and unpack the split list at once:
    return B(*map(int, b_str.split("_", 2)))

with open("/tmp/file.out") as test_out:
     for line in test_out:
         # omit superfluous test and reverse if logic to remove the 'continue'
         # (probably more a matter of style than performance though...)
         if line.startswith("["):
             # replace finding plus twice slicing with one slice and a split:
             b_str, content = line[1:].split("]", 2)
             b_tuple = get_tuple_from_b_string(b_str)
             # again reversing the if-logic to omit the 'continue'.
             # maybe this could be sped up using a set, depending on what you're doing...
             if b_tuple in b_tupled_list:
                 # 'a not in b' is more clear than 'not a in b'
                 if b_tuple not in b_outputs:
                     # omit the empty assignment, string concatenation is slow!
                     b_outputs[b_tuple] = content+'\n'

通过这些优化，您应该变得更快一些。特别是你的字符串操作可能是最昂贵的部分，因为字符串是不可变的，每次尝试修改字符串都会导致创建该字符串的副本。这需要时间和内存，应该在性能关键的脚本中避免使用。

Answer 2

每行都有许多标记有非JSON前缀的单独JSON文档。我首先要做的是使用awk将文件拆分为多个有效的JSON文件，命名为1_03.json。然后我使用像ujson（https://pypi.python.org/pypi/ujson）这样的快速JSON解析器将所有这些读入Python。总体而言，这可能要快得多。

awk脚本可能大致相同（经过轻微测试）：

BEGIN {
    FS="]";
}

{
    filename=substr($1, 2)".json"; # e.g. 1_03.json
    $1="";
    print > filename;
}

然后，这只是glob Python中的文件并使用ujson解析。

在python中读取相对较大的文件需要花费很多时间

2 个答案: