我有一个包含16个不同列的文件(以制表符分隔的值):
22 51169729 G 39 A 0 0 C 0 0 G 38 0.974359 T 1 0.025641 22 51169730 A 36 A 36 1 C 0 0 G 0 0 T 0 0 22 51169731 C 39 A 0 0 C 39 1 G 0 0 T 0 0 22 51169732 G 37 A 0 0 C 0 0 G 37 1 T 0 0 22 51169733 G 33 A 0 0 C 0 0 G 33 1 T 0 0 22 51169734 C 35 A 0 0 C 35 1 G 0 0 T 0 0 22 51169735 A 32 A 32 1 C 0 0 G 0 0 T 0 0 22 51169736 G 32 A 0 0 C 0 0 G 32 1 T 0 0 22 51169737 C 30 A 0 0 C 30 1 G 0 0 T 0 0 22 51169738 T 27 A 0 0 C 0 0 G 0 0 T 27 1 22 51169739 G 26 A 0 0 C 0 0 G 26 1 T 0 0 22 51169740 A 25 A 25 1 C 0 0 G 0 0 T 0 0 22 51169741 C 22 A 0 0 C 22 1 G 0 0 T 0 0 22 51169742 G 23 A 0 0 C 0 0 G 23 1 T 0 0 22 51169743 C 21 A 0 0 C 21 1 G 0 0 T 0 0 22 51169744 C 22 A 0 0 C 22 1 G 0 0 T 0 0 22 51169745 C 19 A 0 0 C 19 1 G 0 0 T 0 0 22 51169746 C 19 A 0 0 C 19 1 G 0 0 T 0 0 22 51169747 A 15 A 14 0.933333 C 1 0.0666667 G 0 0 T 0 0 22 51169748 C 20 A 0 0 C 20 1 G 0 0 T 0 0
第三列可以是A,G,C或T.
我想:
对整个文件执行此操作时,在某些情况下只剩下4列,在其他情况下只剩下7列,如下例所示:
22 51169729 G 39 T 1 0.025641 22 51169730 A 36 22 51169731 C 39 22 51169732 G 37 22 51169733 G 33 22 51169734 C 35 22 51169735 A 32 22 51169736 G 32 22 51169737 C 30 22 51169738 T 27 22 51169739 G 26 22 51169740 A 25 22 51169741 C 22 22 51169742 G 23 22 51169743 C 21 22 51169744 C 22 22 51169745 C 19 22 51169746 C 19 22 51169747 A 15 C 2 0.133333 22 51169748 C 20
有什么建议吗?
答案 0 :(得分:1)
第一部分的Perl解决方案:
#!/usr/bin/perl
use warnings;
use strict;
my %remove = ( A => 4, # Where to start removing the columns
C => 7, # for a given character in column #3.
G => 10,
T => 13,
);
$\ = "\n"; # Add newline to prints.
$, = "\t"; # Separate values by tabs.
while (<>) { # Read input line by line;
chomp; # Remove newline.
my @F = split /\t/; # Split on tabs, populate an array.
splice @F, $remove{ $F[2] }, 3; # Remove the columns.
print @F; # Output.
}
一旦澄清了第二个要求,我就可以尝试添加更多代码。您想要删除哪些值?你能展示更多的例子吗?
答案 1 :(得分:0)
这是第一部分的一种方法,假设没有空字段:
$ cat tst.awk
$3 == "A" { $5=$6=$7="" }
$3 == "C" { $8=$9=$10="" }
$3 == "G" { $11=$12=$13="" }
$3 == "T" { $14=$15=$16="" }
{ gsub(/[[:space:]]+/,"\t"); print }
$ awk -f tst.awk file
1 957584 C 157 A 1 0.006 G 0 0 T 0 0
我真的不明白你在第二部分尝试做什么,但听起来这可能是你想要的,如果7/10/13的测试是修改后的字段数字第一阶段:
$3 == "A" { $5=$6=$7="" }
$3 == "C" { $8=$9=$10="" }
$3 == "G" { $11=$12=$13="" }
$3 == "T" { $14=$15=$16="" }
{ $0=$0 }
$7 ~ /0/ { c++ }
$10 ~ /0/ { c++ }
$13 ~ /0/ { c++ }
c > 1 { $8=$9=$10="" }
{ c=0; gsub(/[[:space:]]+/,"\t"); print }
或者如果$ 7/10/13上的测试是原始字段编号:
$7 ~ /0/ { c++ }
$10 ~ /0/ { c++ }
$13 ~ /0/ { c++ }
$3 == "A" { $5=$6=$7="" }
$3 == "C" { $8=$9=$10="" }
$3 == "G" { $11=$12=$13="" }
$3 == "T" { $14=$15=$16="" }
c > 1 { $8=$9=$10="" }
{ c=0; gsub(/[[:space:]]+/,"\t"); print }
如果没有,请编辑您的问题,以便通过更好的示例进行说明。