我有七个测试文件。他们看起来像是
文件1
chr start end strand
chr1 10525 10525 +
chr1 10542 10542 +
chr1 10571 10571 +
chr1 10577 10577 +
chr2 10589 10589 +
chr2 565262 565262 +
chr2 565397 565397 +
chr3 567239 567239 +
chr3 567312 567312 +
chr4 567348 567348 +
如何以下列格式获取至少两个文件中的常用行
chr start end strand File1 File2 File3 File4 File5 File6 File7
chr1 10525 10525 + 0 1 0 0 0 1 1
chr1 10542 10542 + 1 1 1 1 1 0 0
chr1 10571 10571 + 0 1 0 1 1 0 0
chr3 10577 10577 + 1 1 0 0 0 1 0
chr3 10589 10589 + 0 0 1 0 1 0 1
chr4 565262 565262 + 1 0 0 1 1 1 1
“1”表示给定文件中存在的行,“0”表示存在于给定文件中的行。我不想显示任何文件中不常见的行。
答案 0 :(得分:0)
使用awk:
awk '
FNR==1{ #Header line:
fn[++i]=FILENAME; # record filenames
fn[0]=$0; # & file header
}
(FNR>1){ # For lines other than header lines
list[$0]++; # Record line
file_list[$0 FILENAME]++; # Record which file has that line
}
END{
for(t=0;t<=i;t++) printf "%s\t", fn[t]; # Print header & file names
print ""; # Quick hack for printing newline.
for(t in list){ # For every line that occurred in any of the files
if (list[t]>=2){ # If count is >= 2
printf "%s\t", t; # Print line
for(j=1;j<=i;j++) {
printf "%d\t", file_list[t fn[j]]; # Print per file occurrence count.
}
print "" # Print newline.
}
}
}' File{1..7}