我有以下两个文件
"PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
"PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10";
我正在尝试将BC.txt的Col1与PB.txt的Col12进行比较,并将匹配结果彼此相邻打印。对于BC.txt的col1中的相同值,在col2和Col3中具有不同的值。因此,在进行比较时,我仅获得BC.txt一项的输出。但是我想要所有。
awk 'BEGIN {OFS=FS} NR==FNR {a[$1]=($2" "$3);next} $12 in a {print $0,a[$12]}' BC.txt PB.txt
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
我想比较BC.txt和PB.txt的所有条目;但由于其值相同,因此我的代码无法正常工作。
答案 0 :(得分:1)
如果与问题中的预期输出相比,您不关心输出行顺序,则将BC.txt读入内存,这很简单:
$ cat tst.awk
NR==FNR {
map[$1,++cnt[$1]] = $2 OFS $3
next
}
{
for (c=1; c<=cnt[$12]; c++) {
print $0, map[$12,c]
}
}
$ awk -f tst.awk BC.txt PB.txt
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
但是如果您愿意的话,
$ cat tst.awk
NR==FNR {
map[$12,++cnt[$12]] = $0
next
}
{
for (c=1; c<=cnt[$1]; c++) {
print map[$1,c], $2, $3
}
}
$ awk -f tst.awk PB.txt BC.txt
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
答案 1 :(得分:1)
您可以使用join
来执行此操作吗? (如果列已排序,或者<()到sort
。
$ join BC.txt <(awk '{print $12,$0}' PB.txt) | cut -d' ' -f 4-
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10";
并从联接中剪切/确认想要的列?
答案 2 :(得分:0)
请您尝试以下操作(仅通过提供的示例进行测试)。
awk '
FNR==NR{
a[++count]=$0
b[count]=$12
next
}
{
for(i=1;i<=count;i++){
split(a[i],array," ")
if($1==array[12]){
print a[i],$2,$3
}
}
}' PB.txt BC.txt
说明: 现在添加上述代码的说明。
awk ' ##Starting awk program here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when PB.txt is being read.
a[++count]=$0 ##Creating an array named a whose index is variable count with incrment value of 1 and value is current line.
b[count]=$12 ##Creating an array named b whose index is variabe count and value if 12th column.
next ##next will skip all further statements from here.
}
{
for(i=1;i<=count;i++){ ##Starting a for loop from here from i=1 to till value of count.
split(a[i],array," ") ##Splitting value of a[i] into array named array whose delimiter is space.
if($1==array[12]){ ##Checking condition if $1 is equal to array[12] then do following.
print a[i],$2,$3 ##Printing array a value along with 2nd and 3rd column value.
}
}
}' PB.txt BC.txt ##Mentioning Input_files names here.