如何从另一个文件中拉出与每一行匹配的文件的所有行并输出到单独的行中?

时间:2015-11-05 21:40:39

标签: awk grep

这是一个与之前提出的问题类似的问题(请参阅下面的链接),但这次我想将常用字符串输出到行而不是列,如下所示:

我有两个文件,每个文件都有一个如下所示的列:

档案1

chr1 106623434
chr1 106623436
chr1 106623442
chr1 106623468
chr1 10699400
chr1 10699405
chr1 10699408
chr1 10699415
chr1 10699426
chr1 10699448
chr1 110611528
chr1 110611550
chr1 110611552
chr1 110611554
chr1 110611560

文件2

chr1 1066234
chr1 106994
chr1 1106115

我想搜索文件1并拉出与文件2的第1行完全匹配的所有行,并在其自己的行上输出所有匹配项。然后我想对文件2的第2行执行相同的操作,依此类推,直到在文件1中找到文件2的所有匹配并输出到它自己的行。此外,我正在处理非常大的文件,因此不需要将文件2完全存储在内存中,否则它将无法运行完成。希望输出看起来像这样:

chr1 106623434  chr1 106623436  chr1 106623442  chr1 106623468
chr1 10699400   chr1 10699405   chr1 10699408   chr1 10699415   chr1 10699426  chr1 10699448 
chr1 110611528  chr1 110611550  chr1 110611552  chr1 110611554  chr1 110611560  

类似的问题: How to move all strings in one file that match the lines of another to columns in an output file?

3 个答案:

答案 0 :(得分:3)

只要您的图案不完全重叠,这应该可以正常工作

$ while read p; do grep "$p" file1 | tr '\n' '\t'; echo "";  done < file2
chr1 106623434  chr1 106623436  chr1 106623442  chr1 106623468
chr1 10699400   chr1 10699405   chr1 10699408   chr1 10699415   chr1 10699426   chr1 10699448
chr1 110611528  chr1 110611550  chr1 110611552  chr1 110611554  chr1 110611560

答案 1 :(得分:1)

你可以这样做,因为它使用接近零的内存,但它会非常慢,因为它为“file2”的每一行读取整个“file1”一次:

$ cat tst.awk
{
    ofs = ors = ""
    while ( (getline line < "file1") > 0) {
        if (line ~ "^"$0) {
            printf "%s%s", ofs, line
            ofs = "\t"
            ors = "\n"
        }
    }
    printf ors
    close("file1")
}

$ awk -f tst.awk file2
chr1 106623434  chr1 106623436  chr1 106623442  chr1 106623468
chr1 10699400   chr1 10699405   chr1 10699408   chr1 10699415   chr1 10699426   chr1 10699448
chr1 110611528  chr1 110611550  chr1 110611552  chr1 110611554  chr1 110611560

答案 2 :(得分:0)

你可以尝试

awk -vOFS="\t" '
NR==FNR{                      #only file2
    keys[++i]=$0;             #'keys' store pattern to search ('i' contains number of keys)
    next;                     #stop processing the current record and 
                              #go on to the next record
}
{
    for(j=1; j<=i; ++j)
        #if line start with key then add
        if($0 ~ "^"keys[j])
            a[keys[j]] = a[keys[j]] (a[keys[j]]!=""?OFS:"") $0;
}
END{
    for(j=1; j<=i; ++j) print a[keys[j]];  #print formating lines
}' file2 file1

你明白了,

chr1 106623434  chr1 106623436  chr1 106623442  chr1 106623468
chr1 10699400   chr1 10699405   chr1 10699408   chr1 10699415   chr1 10699426   chr1 10699448
chr1 110611528  chr1 110611550  chr1 110611552  chr1 110611554  chr1 110611560