Question

文件1：

chr1    14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468
chr1    14969   15038   NR_024540_1_r_WASH7P_69
chr1    15795   15947   NR_024540_2_r_WASH7P_152
chr1    16606   16765   NR_024540_3_r_WASH7P_15
chr1    16857   17055   NR_024540_4_r_WASH7P_198

和file2：

NR_024540 11

我需要在file2中找到匹配file1并打印整个file1 + second column of file2

所以ouptut是：

  chr1  14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1    14969   15038   NR_024540_1_r_WASH7P_69 11
chr1    15795   15947   NR_024540_2_r_WASH7P_152 11
chr1    16606   16765   NR_024540_3_r_WASH7P_15 11
chr1    16857   17055   NR_024540_4_r_WASH7P_198 11

我的解决方案在bash中非常缓慢：

#!/bin/bash

while read line; do

c=$(echo $line | awk '{print $1}')
d=$(echo $line | awk '{print $2}')

grep $c file1 | awk -v line="$d" -v OFS="\t" '{print $1,$2,$3,$4"_"line}' >> output


 done < file2

我更喜欢更快的任何bash或awk解决方案。输出可以修改，但需要保留所有信息（列的顺序可以不同）。

编辑：

现在，根据@chepner，它看起来像是最快的解决方案：

#!/bin/bash

while read -r c d; do

grep $c file1 | awk -v line="$d" -v OFS="\t" '{print $1,$2,$3,$4"_"line}' 

done < file2 > output

Answer 1

在单个Awk命令中

awk 'FNR==NR{map[$1]=$2; next}{ for (i in map) if($0 ~ i){$(NF+1)=map[i]; print; next}}' file2 file1

chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11

多线程中更易阅读的版本

FNR==NR {
    # map the values from 'file2' into the hash-map 'map'
    map[$1]=$2
    next
}
# On 'file1' do
{
    # Iterate through the array map
    for (i in map){
        # If there is a direct regex match on the line with the 
        # element from the hash-map, print it and append the 
        # hash-mapped value at last
        if($0 ~ i){
            $(NF+1)=map[i]
            print
            next
        }
    }
}

Answer 2

试试这个 -

 cat file2
NR_024540 11
NR_024541 12

 cat file11
chr1    14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468
chr1    14361   14829   NR_024542_0_r_DDX11L1,WASH7P_468
chr1    14969   15038   NR_024540_1_r_WASH7P_69
chr1    15795   15947   NR_024540_2_r_WASH7P_152
chr1    16606   16765   NR_024540_3_r_WASH7P_15
chr1    16857   17055   NR_024540_4_r_WASH7P_198
chr1    14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468
chr1    14969   15038   NR_024540_1_r_WASH7P_69
chr1    15795   15947   NR_024540_2_r_WASH7P_152
chr1    16606   16765   NR_024540_3_r_WASH7P_15


awk 'NR==FNR{a[$1]=$2;next} substr($4,1,9) in a {print $0,a[substr($4,1,9)]}' file2 file11
chr1    14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1    14969   15038   NR_024540_1_r_WASH7P_69 11
chr1    15795   15947   NR_024540_2_r_WASH7P_152 11
chr1    16606   16765   NR_024540_3_r_WASH7P_15 11
chr1    16857   17055   NR_024540_4_r_WASH7P_198 11
chr1    14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1    14969   15038   NR_024540_1_r_WASH7P_69 11
chr1    15795   15947   NR_024540_2_r_WASH7P_152 11
chr1    16606   16765   NR_024540_3_r_WASH7P_15 11

表现 - （经过测试55000条记录）

time awk 'NR==FNR{a[$1]=$2;next} substr($4,1,9) in a {print $0,a[substr($4,1,9)]}' file2 file1 > output1

real    0m0.16s
user    0m0.14s
sys     0m0.01s

Answer 3

使用join和sed的另一种解决方案，假设file1和file2已排序

join <(sed -r 's/[^ _]+_[^_]+/& &/' file1) file2 -1 4 -2 1 -o "1.1 1.2 1.3 1.5 2.2" > output

如果输出顺序无关紧要，请使用awk

awk 'FNR==NR{d[$1]=$2; next}
    {split($4,v,"_"); key=v[1]"_"v[2]; if(key in d) print $0, d[key]}
' file2 file1

你明白了，

chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11

Answer 4

您正在不必要地启动许多外部程序。让read为您分配来自file2的收到行，而不是两次调用awk。也没有必要运行grep; awk可以自行进行过滤。

while read -r c d; do
    awk -v field="$c" -v line="$d" -v OFS='\t' '$0 ~ field {print $1,$2,$3,$4"_"line}' file1
done < file2 > output

Answer 5

如果搜索到的字符串长度始终相同（length("NR_024540")==9）：

awk 'NR==FNR{a[$1]=$2;next} (i=substr($4,1,9)) && (i in a){print $0, a[i]}' file2 file1

说明：

NR==FNR {                         # process file2
    a[$1]=$2                      # hash record using $1 as the key
    next                          # skip to next record
} 
(i=substr($4,1,9)) && (i in a) {  # read the first 9 bytes of $4 to i and search in a
    print $0, a[i]                # output if found
}

Answer 6

awk -F '[[:blank:]_]+' '
   FNR==NR { a[$2]=$3 ;next }
   { if ( $5 in a ) $0 = $0 " " a[$5] }
   7
   ' file2 file1

注释：

使用_作为额外字段分隔符，因此文件名更容易在两个文件中进行比较（仅使用数字部分）。
7是为了好玩，它只是一个非0值 - ＆gt;打印行
我没有改变字段（NF + 1，...）所以我们保留原始格式只添加引用的数字

较小的oneliner代码（针对代码大小进行了优化）（假设file1中的非空行是必需的）。如果separator只是空格，你可以用空格字符

重新设置[：blank：]

awk -F '[[:blank:]_]+' 'NF==3{a[$2]=$3;next}$0=$0" "a[$5]' file2 file1

Answer 7

不需要var agents = db.AllAgentLocations().AsEnumerable() .Where(al => al.PrimaryOffice) .Select(al => new AgentDistanceViewModel { Agent = al.Agent, Distance = searchCoords.GetDistanceTo( new GeoCoordinate { Latitude = double.Parse(al.Location.Latitude), Longitude = double.Parse(al.Location.Longitude) }) / 1609.34 }) .Where(a => a.Distance < 25);或awk。假设 file2 只有一行：

sed

更快的解决方案来比较bash中的文件

7 个答案: