如何获得至少存在两个或更多文件的公共行?

时间:2015-06-04 08:06:49

标签: bash text text-files

我有七个测试文件。他们看起来像是

文件1

chr     start   end     strand
chr1    10525   10525   +
chr1    10542   10542   +
chr1    10571   10571   +
chr1    10577   10577   +
chr2    10589   10589   +
chr2    565262  565262  +
chr2    565397  565397  +
chr3    567239  567239  +
chr3    567312  567312  +
chr4    567348  567348  +

如何以下列格式获取至少两个文件中的常用行

chr     start   end     strand  File1   File2   File3   File4   File5   File6   File7
chr1    10525   10525   +   0   1   0   0   0   1   1
chr1    10542   10542   +   1   1   1   1   1   0   0
chr1    10571   10571   +   0   1   0   1   1   0   0
chr3    10577   10577   +   1   1   0   0   0   1   0
chr3    10589   10589   +   0   0   1   0   1   0   1
chr4    565262  565262  +   1   0   0   1   1   1   1

“1”表示给定文件中存在的行,“0”表示存在于给定文件中的行。我不想显示任何文件中不常见的行。

1 个答案:

答案 0 :(得分:0)

使用awk:

awk '
    FNR==1{ #Header line:
        fn[++i]=FILENAME; # record filenames 
        fn[0]=$0; # & file header
    }

    (FNR>1){ # For lines other than header lines
        list[$0]++; # Record line
        file_list[$0 FILENAME]++; # Record which file has that line
    }

    END{
        for(t=0;t<=i;t++) printf "%s\t", fn[t]; # Print header & file names
        print ""; # Quick hack for printing newline.
        for(t in list){ # For every line that occurred in any of the files
            if (list[t]>=2){ # If count is >= 2
                printf "%s\t", t; # Print line
                for(j=1;j<=i;j++) {
                    printf "%d\t", file_list[t fn[j]]; # Print per file occurrence count.
                }
                print "" # Print newline.
            }
        }
    }' File{1..7}