Question

我有两个相同数量的列（制表符分隔）的文件，看起来像这样

档案A：

12345    Fish    Apple    7123  
321      Chicken Apple    9912  
661      Ant     Apple    316

档案B：

321      Duck    Orange    9912   
12345    Bird    Orange    7123    
661      Eagle   Orange    34

预期产出：

Fiile A_edited    

661    Ant    Apple    316

基于文件B中第1列和第4列的ID，如果两个值都出现在文件A中第1行和第4列中，我想从文件A中删除该行。我尝试使用grep来做这个，但这两个列表很长，每个大约66Gb，所以它仍然运行一天。除了grep之外还有其他更快的方式可以做到吗？

p / s：列数实际上超过4，为简单起见，此处仅显示四列。

awk '{print $1 "\t"$4}'B.txt >> B_edited.txt


# Extract the line number in A.txt containing lines where two IDs are present in B_edited.txt
cat B_edited.txt|while read ID1 ID2
do 
    grep -nE "$ID1.*$ID2"  A.txt|cut -c 1 >> LineNumber.txt
done

# Remove duplicates of line numbers 
sort -u LineNumber.txt >> LineNumberUnique.txt

# Output only lines from A.txt where line numbers are not in the list
awk 'FNR == NR { h[$1]; next } !(FNR in h)' LineNumberUnique.txt A.txt >> A_edited.txt

我非常感谢任何帮助！

谢谢，
仁

Answer 1

$ awk '{k=$1FS$4} NR==FNR{keys[k];next} !(k in keys)' fileB fileA
661      Ant     Apple    316

要使用输出覆盖fileA，只需添加> tmp && mv tmp fileA或使用-i inplace，如果您有GNU awk 4。*。

根据文件B中的两列提取文件A中的行

1 个答案: