更快的解决方案来比较bash中的文件

时间:2017-02-28 12:50:32

标签: linux bash awk sed

文件1:

chr1    14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468
chr1    14969   15038   NR_024540_1_r_WASH7P_69
chr1    15795   15947   NR_024540_2_r_WASH7P_152
chr1    16606   16765   NR_024540_3_r_WASH7P_15
chr1    16857   17055   NR_024540_4_r_WASH7P_198

和file2:

NR_024540 11

我需要在file2中找到匹配file1并打印整个file1 + second column of file2

所以ouptut是:

  chr1  14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1    14969   15038   NR_024540_1_r_WASH7P_69 11
chr1    15795   15947   NR_024540_2_r_WASH7P_152 11
chr1    16606   16765   NR_024540_3_r_WASH7P_15 11
chr1    16857   17055   NR_024540_4_r_WASH7P_198 11

我的解决方案在bash中非常缓慢:

#!/bin/bash

while read line; do

c=$(echo $line | awk '{print $1}')
d=$(echo $line | awk '{print $2}')

grep $c file1 | awk -v line="$d" -v OFS="\t" '{print $1,$2,$3,$4"_"line}' >> output


 done < file2

我更喜欢更快的任何bash或awk解决方案。输出可以修改,但需要保留所有信息(列的顺序可以不同)。

编辑:

现在,根据@chepner,它看起来像是最快的解决方案:

#!/bin/bash

while read -r c d; do

grep $c file1 | awk -v line="$d" -v OFS="\t" '{print $1,$2,$3,$4"_"line}' 

done < file2 > output

7 个答案:

答案 0 :(得分:5)

在单个Awk命令中

awk 'FNR==NR{map[$1]=$2; next}{ for (i in map) if($0 ~ i){$(NF+1)=map[i]; print; next}}' file2 file1

chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11

多线程中更易阅读的版本

FNR==NR {
    # map the values from 'file2' into the hash-map 'map'
    map[$1]=$2
    next
}
# On 'file1' do
{
    # Iterate through the array map
    for (i in map){
        # If there is a direct regex match on the line with the 
        # element from the hash-map, print it and append the 
        # hash-mapped value at last
        if($0 ~ i){
            $(NF+1)=map[i]
            print
            next
        }
    }
}

答案 1 :(得分:2)

试试这个 -

 cat file2
NR_024540 11
NR_024541 12

 cat file11
chr1    14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468
chr1    14361   14829   NR_024542_0_r_DDX11L1,WASH7P_468
chr1    14969   15038   NR_024540_1_r_WASH7P_69
chr1    15795   15947   NR_024540_2_r_WASH7P_152
chr1    16606   16765   NR_024540_3_r_WASH7P_15
chr1    16857   17055   NR_024540_4_r_WASH7P_198
chr1    14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468
chr1    14969   15038   NR_024540_1_r_WASH7P_69
chr1    15795   15947   NR_024540_2_r_WASH7P_152
chr1    16606   16765   NR_024540_3_r_WASH7P_15


awk 'NR==FNR{a[$1]=$2;next} substr($4,1,9) in a {print $0,a[substr($4,1,9)]}' file2 file11
chr1    14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1    14969   15038   NR_024540_1_r_WASH7P_69 11
chr1    15795   15947   NR_024540_2_r_WASH7P_152 11
chr1    16606   16765   NR_024540_3_r_WASH7P_15 11
chr1    16857   17055   NR_024540_4_r_WASH7P_198 11
chr1    14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1    14969   15038   NR_024540_1_r_WASH7P_69 11
chr1    15795   15947   NR_024540_2_r_WASH7P_152 11
chr1    16606   16765   NR_024540_3_r_WASH7P_15 11
  

表现 - (经过测试55000条记录)

time awk 'NR==FNR{a[$1]=$2;next} substr($4,1,9) in a {print $0,a[substr($4,1,9)]}' file2 file1 > output1

real    0m0.16s
user    0m0.14s
sys     0m0.01s

答案 2 :(得分:2)

使用joinsed的另一种解决方案,假设file1file2已排序

join <(sed -r 's/[^ _]+_[^_]+/& &/' file1) file2 -1 4 -2 1 -o "1.1 1.2 1.3 1.5 2.2" > output

如果输出顺序无关紧要,请使用awk

awk 'FNR==NR{d[$1]=$2; next}
    {split($4,v,"_"); key=v[1]"_"v[2]; if(key in d) print $0, d[key]}
' file2 file1 

你明白了,

chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11

答案 3 :(得分:1)

您正在不必要地启动许多外部程序。让read为您分配来自file2的收到行,而不是两次调用awk。也没有必要运行grep; awk可以自行进行过滤。

while read -r c d; do
    awk -v field="$c" -v line="$d" -v OFS='\t' '$0 ~ field {print $1,$2,$3,$4"_"line}' file1
done < file2 > output

答案 4 :(得分:1)

如果搜索到的字符串长度始终相同(length("NR_024540")==9):

awk 'NR==FNR{a[$1]=$2;next} (i=substr($4,1,9)) && (i in a){print $0, a[i]}' file2 file1

说明:

NR==FNR {                         # process file2
    a[$1]=$2                      # hash record using $1 as the key
    next                          # skip to next record
} 
(i=substr($4,1,9)) && (i in a) {  # read the first 9 bytes of $4 to i and search in a
    print $0, a[i]                # output if found
}

答案 5 :(得分:0)

awk -F '[[:blank:]_]+' '
   FNR==NR { a[$2]=$3 ;next }
   { if ( $5 in a ) $0 = $0 " " a[$5] }
   7
   ' file2 file1

注释:

  • 使用_作为额外字段分隔符,因此文件名更容易在两个文件中进行比较(仅使用数字部分)。
  • 7是为了好玩,它只是一个非0值 - &gt;打印行
  • 我没有改变字段(NF + 1,...)所以我们保留原始格式只添加引用的数字

较小的oneliner代码(针对代码大小进行了优化)(假设file1中的非空行是必需的)。如果separator只是空格,你可以用空格字符

重新设置[:blank:]
awk -F '[[:blank:]_]+' 'NF==3{a[$2]=$3;next}$0=$0" "a[$5]' file2 file1

答案 6 :(得分:0)

不需要var agents = db.AllAgentLocations().AsEnumerable() .Where(al => al.PrimaryOffice) .Select(al => new AgentDistanceViewModel { Agent = al.Agent, Distance = searchCoords.GetDistanceTo( new GeoCoordinate { Latitude = double.Parse(al.Location.Latitude), Longitude = double.Parse(al.Location.Longitude) }) / 1609.34 }) .Where(a => a.Distance < 25); awk。假设 file2 只有一行:

sed