Question

大家好我有这个数据文件

File1中

1   The hero
2   Chainsaw and the gang
3   .........
4   .........

其中第一个字段是id，第二个字段是产品名称

文件2

The hero 12
The hero 2
Chainsaw and the gang 2
.......................

从这两个文件中我想要第三个文件

档案3

The hero 12 1
The hero 2 1
Chainsaw and the gang 2 2
.......................

如您所见，我只是添加从文件1读取的索引

我使用了这种方法

awk -F '\t' 'NR == FNR{a[$2]=$1; next}; {print $0, a[$1]}' File1 File2 > File 3

我使用文件1创建此关联数组，并使用文件2中的产品名称进行查找

然而，我的文件很庞大，我有2000万个产品名称，这个过程需要花费很多时间。任何建议，我如何加快速度？

Answer 1

你可以使用这个awk：

awk 'FNR==NR{p=$1; $1=""; sub(/^ +/, ""); a[$0]=p;next} {q=$NF; $NF=""; sub(/ +$/, "")}
     ($0 in a) {print $0, q, a[$0]}' f1 f2
The hero 12 1
The hero 2 1
Chainsaw and the gang 2 2

Answer 2

您发布的脚本无法从您发布的输入文件中生成所需的输出，因此我们先解决这个问题：

$ cat file1
1   The hero
2   Chainsaw and the gang

$ cat file2
The hero 12
The hero 2
Chainsaw and the gang 2

$ awk -F'\t' 'NR==FNR{map[$2]=$1;next} {key=$0; sub(/[[:space:]]+[^[:space:]]+$/,"",key); print $0, map[key]}' file1 file2
The hero 12 1
The hero 2 1
Chainsaw and the gang 2 2

现在，这真的太慢了，或者你做了一些预处理还是后处理，这是真正的速度问题？

明显的加速是你的＆＃34; file2＆＃34;排序后，您可以在键更改时删除相应的map []值，这样每次使用时map []都会变小。例如这样的事情（未经测试）：

$ awk -F'\t' '
NR==FNR {map[$2]=$1; next}
{ key=$0; sub(/[[:space:]]+[^[:space:]]+$/,"",key); print $0, map[key] }
key != prev { delete map[prev] }
{ prev = key }
' file1 file2

填充map []时的替代方法使用太多时间/内存并对file2进行排序：

$ awk '
{   key=$0
    sub(/[[:space:]]+[^[:space:]]+$/,"",key)
    if (key != prev) {
        cmd = "awk -F\"\t\" -v key=\"" key "\" \047$2 == key{print $1;exit}\047 file1"
        cmd | getline val
        close(cmd)
    }
    print $0, val
    prev = key
}' file2

Answer 3

根据评论，您的查找存在扩展问题。一般的解决方法是合并排序的序列：

join -t $'\t' -1 2 -2 1 -o 1.2,2.2,1.1 \
    <( sort -t $'\t' -k2 file1) \
    <( sort -t $'\t' -sk1,1 file2)

我认为Windows无法进行流程替换，因此您必须使用临时文件：

sort -t $'\t' -k2 file1 >idlookup.bykey
sort -t $'\t' -sk1,1 file2 >values.bykey

join -t $'\t' -1 2 -2 1 -o 1.2,2.2,1.1 idlookup.bykey values.bykey

如果您需要保留值查找序列，请使用nl将行号放在前面，然后对末尾的行号进行排序。

Answer 4

如果您的问题是性能，请尝试使用此perl脚本：

#!/usr/bin/perl -l 

use strict;
use warnings;

my %h;
open my $fh1 , "<", "file1.txt";
open my $fh2 , "<", "file2.txt";
open my $fh3 , ">", "file3.txt";

while (<$fh1>) {
    my ($v, $k) = /(\d+)\s+(.*)/;
    $h{$k} = $v;
}

while (<$fh2>) {
    my ($k, $v) = /(.*)\s+(\d+)$/;
    print $fh3 "$k $v $h{$k}" if exists $h{$k};
}

将上述脚本保存在script.pl中并将其作为perl script.pl运行。确保file1.txt和file2.txt与脚本位于同一目录中。

使用awk映射索引的问题

4 个答案: