我有成千上万的值(作为列表但可能会转换为字典,如果有帮助的话),并希望与具有数百万行的文件进行比较。我想要做的是将文件中的行过滤为仅以列表中的值开头的行。
最快的方法是什么?
我的慢代码:
for line in source_file:
# Go through all IDs
for id in my_ids:
if line.startswith(str(id) + "|"):
#replace comas with semicolons and pipes with comas
target_file.write(line.replace(",",";").replace("|",","))
答案 0 :(得分:3)
如果您确定该行以id +" |"和" |"开头。不会出现在id中,我想你可以用" |"来玩一些技巧。例如:
my_id_strs = map(str, my_ids)
for line in source_file:
first_part = line.split("|")[0]
if first_part in my_id_strs:
target_file.write(line.replace(",",";").replace("|",","))
希望这会有所帮助:)
答案 1 :(得分:1)
使用string.translate
进行替换。你也可以在匹配id之后休息一下。
from string import maketrans
trantab = maketrans(",|", ";,")
ids = ['%d|' % id for id in my_ids]
for line in source_file:
# Go through all IDs
for id in ids:
if line.startswith(id):
#replace comas with semicolons and pipes with comas
target_file.write(line.translate(trantab))
break
或
from string import maketrans
#replace comas with semicolons and pipes with comas
trantab = maketrans(",|", ";,")
idset = set(my_ids)
for line in source_file:
try:
if line[:line.index('|')] in idset:
target_file.write(line.translate(trantab))
except ValueError as ve:
pass
答案 2 :(得分:0)
使用正则表达式。这是一个实现:
with open("/usr/share/dict/words") as words:
prefixes = [line.strip() for line in words]
lines = [
"zoo this should match",
"000 this shouldn't match",
]
print(list(filterlines(prefixes, lines)))
我们首先构建并编译一个正则表达式(昂贵但只有一次),但匹配非常非常快。
上述测试代码:
{{1}}