我有两个文件
此形式的tree_0:
443457316403167232 823615 Tue Mar 11 18:43:57 +0000 2014 2
452918771813203968 26558552 Tue Mar 11 21:10:17 +0000 2014 0
443344824096538625 375391930 Tue Mar 11 11:16:57 +0000 2014 9
452924891285581824 478500516 Tue Mar 11 11:38:14 +0000 2014 0
trees.json
{"reply": 0, "id": 452918771813203968, "children": [{"reply": 0, "id": 452924891285581824, "children": []}]}
现在,我必须通过文件trees.json并在tree_0中找到id,如果它存在,那么我必须执行一些任务。
我使用readlines()加载了tree_0。 这两个文件都非常大(10GB大小)。我编写了一段代码,但想知道这段代码是否正常,或者可以提高效率。 至于每个id,它都会进入整个tree_0(while循环)。
import json
import sys
sys.setrecursionlimit(2000)
fr=open('tree_0','r')
lines=fr.readlines()
l=len(lines)
# to find children of trees, this works fine
def get_children(node):
stack = [node]
while stack:
node = stack.pop()
stack.extend(node['children'][::-1])
yield node
f = open('trees.json','r')
linenum=0
for line in f:
d = json.loads(line)
child_dic={}
if (linenum<1000):
for child in get_children(d):
if child not in child_dic.keys():
i=0
while (i< l): # checkwhetherthis makes it slow as my files are large
data=lines[i].split('\t')
# search for id in the tree_0 file
if str(child["id"])==str(data[0]):
print "Perform some task here"
i=i+1
答案 0 :(得分:2)
我认为你在这里做了很多不必要和低效的工作。首先,由于您只需要ID,因此您不必将整个tree_0
文件存储在内存中。而不是每次迭代所有行并提取ID,在加载文件时只执行一次。此外,您可以将ID存储在set
中。这将大大提高查找速度。
with open('tree_0') as f:
all_ids = set(int(line.split('\t')[0]) for line in f)
如果您 还需要来自tree_0
的其他字段,您可以将其设为字典,将ID映射到其他字段。这仍然比每次循环列表的查找速度快得多。
with open('tree_0') as f:
all_ids = dict((int(items[0]), items) for items in (line.split('\t') for line in f))
通过此更改,您的其余代码归结为:
with open('trees.json') as f:
for line in f:
d = json.loads(line)
for child in get_children(d):
if child["id"] in all_ids:
# optional: get other stuff from dict
# other_stuff = all_ids[child["id"]]
print "Perform some task here"
更新:如果tree_0
中的“ID”不唯一,即如果您有多条具有相同ID的行,则可以使用,例如一个defaultdict
映射ID到其他属性的列表,比如这个
with open('tree_0') as f:
all_ids = collections.defaultdict(list)
for line in f:
items = line.split('\t')
all_ids[int(items[0])].append(items)
然后,在代码的其他部分,只需对列表中的所有条目执行任务:
if child["id"] in all_ids:
for other_stuff in all_ids[child["id"]]:
print "Perform some task here", other_stuff