我有一个 265 MB 和 7504527 文字的文件。
这个循环耗费了大量时间:
B = namedtuple("B", ["id", "ch"])
def get_tuple_from_b_string(b_str):
return B(int(b_str.split("_")[0]),int(b_str.split("_")[1]))
with open("/tmp/file.out") as test_out:
for line in test_out:
if line == '' or not line.startswith("["):
continue
bracket_index = line.find(']')
b_str = line[1:bracket_index]
content = line[bracket_index+1:]
b_tuple = get_tuple_from_b_string(b_str)
if not b_tuple in b_tupled_list:
continue
if not b_tuple in b_outputs:
b_outputs[b_tuple] = ''
b_outputs[b_tuple] += content+'\n'
我现在正在跑步,但在19:38分钟之后还没有完成 我试图检查过程的strace并重复这些行:
mmap2(NULL,3145728,PROT_READ | PROT_WRITE,MAP_PRIVATE | MAP_ANONYMOUS, -1,0xff9093fc)= 0xfffffffff4997000
munmap(0xf678b000,31452728)
但不同的地址(我认为它没有被卡住,仍然在阅读) 我的问题是:
文件内容示例:
[1_03]{
[1_03] "0": {
[1_03] "interfaces": {
[1_03] "address": [],
[1_03] "distribution": [],
[1_03] "interface": [],
[1_03] "netmask": []
[1_03] },
答案 0 :(得分:0)
一些性能优化:
def get_tuple_from_b_string(b_str):
# do not split the string twice, map and unpack the split list at once:
return B(*map(int, b_str.split("_", 2)))
with open("/tmp/file.out") as test_out:
for line in test_out:
# omit superfluous test and reverse if logic to remove the 'continue'
# (probably more a matter of style than performance though...)
if line.startswith("["):
# replace finding plus twice slicing with one slice and a split:
b_str, content = line[1:].split("]", 2)
b_tuple = get_tuple_from_b_string(b_str)
# again reversing the if-logic to omit the 'continue'.
# maybe this could be sped up using a set, depending on what you're doing...
if b_tuple in b_tupled_list:
# 'a not in b' is more clear than 'not a in b'
if b_tuple not in b_outputs:
# omit the empty assignment, string concatenation is slow!
b_outputs[b_tuple] = content+'\n'
通过这些优化,您应该变得更快一些。特别是你的字符串操作可能是最昂贵的部分,因为字符串是不可变的,每次尝试修改字符串都会导致创建该字符串的副本。这需要时间和内存,应该在性能关键的脚本中避免使用。
答案 1 :(得分:0)
每行都有许多标记有非JSON前缀的单独JSON文档。我首先要做的是使用awk
将文件拆分为多个有效的JSON文件,命名为1_03.json。然后我使用像ujson
(https://pypi.python.org/pypi/ujson)这样的快速JSON解析器将所有这些读入Python。总体而言,这可能要快得多。
awk
脚本可能大致相同(经过轻微测试):
BEGIN {
FS="]";
}
{
filename=substr($1, 2)".json"; # e.g. 1_03.json
$1="";
print > filename;
}
然后,这只是glob
Python中的文件并使用ujson
解析。