Question

我有成千上万的值（作为列表但可能会转换为字典，如果有帮助的话），并希望与具有数百万行的文件进行比较。我想要做的是将文件中的行过滤为仅以列表中的值开头的行。

最快的方法是什么？

我的慢代码：

  for line in source_file:
    # Go through all IDs
    for id in my_ids:
      if line.startswith(str(id) + "|"):
        #replace comas with semicolons and pipes with comas
        target_file.write(line.replace(",",";").replace("|",","))

Answer 1

如果您确定该行以id +＆＃34; |＆＃34;和＆＃34; |＆＃34;开头。不会出现在id中，我想你可以用＆＃34; |＆＃34;来玩一些技巧。例如：

my_id_strs = map(str, my_ids)
for line in source_file:
    first_part = line.split("|")[0]
    if first_part in my_id_strs:
        target_file.write(line.replace(",",";").replace("|",","))

希望这会有所帮助：）

Answer 2

使用string.translate进行替换。你也可以在匹配id之后休息一下。

from string import maketrans

trantab = maketrans(",|", ";,")

ids = ['%d|' % id for id in my_ids]

for line in source_file:
    # Go through all IDs
    for id in ids:
      if line.startswith(id):
        #replace comas with semicolons and pipes with comas
        target_file.write(line.translate(trantab))
        break

或

from string import maketrans

#replace comas with semicolons and pipes with comas
trantab = maketrans(",|", ";,")
idset = set(my_ids)

for line in source_file:
    try:
        if line[:line.index('|')] in idset:            
            target_file.write(line.translate(trantab))
    except ValueError as ve:
        pass

Answer 3

使用正则表达式。这是一个实现：

with open("/usr/share/dict/words") as words:
    prefixes = [line.strip() for line in words]

lines = [
    "zoo this should match",
    "000 this shouldn't match",
]

print(list(filterlines(prefixes, lines)))

我们首先构建并编译一个正则表达式（昂贵但只有一次），但匹配非常非常快。

上述测试代码：

{{1}}

最快检查行是否以列表中的值开头？

3 个答案: