Question

给定两个文件，一个包含以下形式的条目：

label1 label2 name1
label1 label3 name2

和另一个形式：

label1 label2 name1 0.1 1000
label9 label6 name7 0.8 0.5

假设您要从文件二中提取这些行，前三个元素在文件一中出现在一行（重要的顺序）中 - 有关如何快速填充的任何建议吗？

给定上述样本数据的任何此类脚本的输出文件将为：

label1 label2 name1 0.1 1000

我玩弄python：

inp = open(file1.txt, 'r')
look_up = [i.split() for i in inp.readlines()]
inp.close()

inp = open('file2', 'wt')

holder = []

line = inp.readline()
while line:
    line = line.split()
    if [line[0], line[1], line[2]] in look_up:
        holder.append(line)
    line = inp.readline()

然而，这似乎需要一段时间。这些文件相当大。

谢谢！

Answer 1

您的python版本效率相当低，因为您正在测试列表中的成员资格，而不是集合或字典（即O（n）查找时间而不是O（1））。

尝试使用set元组或set字符串代替。元组是一个更好的选择，因为这两个文件可以分成不同的分隔符，但我认为你不会看到特别大的性能差异。与测试很长列表的成员资格相比，tuple('something'.split())相对较快。

此外，无需致电inp.readlines()。换句话说，你可以做到

look_up = set(tuple(line.split()) for line in inp)

除了tuple(line[:3])而不是[line[0], line[1], line[2]]之外，您应该看到显着的加速而不必更改代码的任何其他部分。

实际上，grep和bash对此非常完美......（未经测试，但它应该有效。）

while read line
do
    grep "$line" "file2.txt"
done < "file1.txt"

要查看哪一个更快，我们可以generate some test data（file1.txt中的~4500个键和file2.txt中的~10000个键），并对同一事物的简单python版本进行基准测试（粗略地说。 ..这些行将以与grep版本不同的顺序打印。）。

with open('file1.txt', 'r') as keyfile:
    lookup = set(tuple(line.split()) for line in keyfile)

with open('file2.txt', 'r') as datafile:
    for line in datafile:
        if tuple(line.split()[:3]) in lookup:
            print line,

python版本的速度提高了约70倍：

jofer@cornbread:~/so> time sh so_temp149.sh > a

real    1m47.617s
user    0m51.199s
sys     0m54.391s

VS

jofer@cornbread:~/so> time python so_temp149.py > b

real    0m1.631s
user    0m1.558s
sys     0m0.071s

当然，这两个示例正在以完全的方式以不同的方式解决问题。我们真的在比较两种算法，而不是两种算法。例如，如果file1中只有几个关键行，则bash / grep解决方案很容易获胜。

（bash有一个内置的容器，有O（1）查找成员资格吗？（我认为bash 4可能有一个哈希表，但我对此一无所知......）这会很有趣尝试在bash中实现类似于上面python示例的算法，以及...）

Answer 2

Hacky bash / sort / Perl解决方案：

$ cat > 1
label1 label2 name1
label1 label3 name2

$ cat > 2
label1 label2 name1 0.1 1000
label9 label6 name7 0.8 0.5

$ (cat 1; cat 2; ) | sort | perl -ne 'INIT{$pattern_re="(?:\\S+) (?:\\S+) (?:\\S+)"; $current_pattern="";} if(/^($pattern_re)$/o){$current_pattern=$1} else {if(/^($pattern_re)/o) { print if $1 eq $current_pattern} }'
label1 label2 name1 0.1 1000

它将两个文件合并到一个列表中，对其进行排序（因此我们使用相同的密钥从文件1中逐行获取数据块），然后使用特殊的Perl oneliner仅保留优先级高于“header”的行“来自档案1。

Answer 3

您可以尝试使用字符串“label1 label2 name1”作为键，而不是那些三元组值。

Answer 4

我使用哈希来存储第一个文件中的值。虽然不是错误恢复（每个项目之间只有1个空格），但你会得到一般的想法...

#!/usr/bin/env python

labels={}
with open('log') as fd:
    for line in fd:
        line=line.strip()
        labels[line]=True

with open('log2') as fd:
    for line in fd:
        if " ".join(line.split()[0:3]) in labels:
            print line

处理大文件 - python或命令行的建议？

4 个答案: