我尝试处理一些数据,但我无法为我的问题找到有效的解决方案。我有一个看起来像的文件:
>ram
cacacacacacacacacatatacacatacacatacacacacacacacacacacacacaca
cacacacacacacaca
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct
>sam
AATTGGCCAATTGGCAATTCCGGAATTCaattggccaattccggaattccaattccgg
and many lines more....
我想过滤掉所有行和相应的标题(标题以>开头),其中序列字符串(不以>开头)包含30%或更多的小写字母。序列字符串可以跨越多行。
因此在命令xy之后输出应该如下:
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct
我尝试了一些while循环用于读取输入文件,然后使用awk,grep,sed但是没有好结果。
答案 0 :(得分:4)
这是一个想法,它将记录分隔符设置为">"将每个标题的序列行视为单个记录。
因为输入以">"开头,这会导致初始空记录,我们使用NR > 1
(记录号大于1)保护计算。
要计算字符数,我们会添加标题后所有行的长度。要计算小写字符的数量,我们将字符串保存在另一个变量中并使用gsub将所有小写字母替换为空 - 只是因为gsub返回所做的替换次数,这是一种方便的计数方式它们。
最后,我们检查比率并打印与否(在我们打印时添加回初始">"
BEGIN { RS = ">" }
NR > 1 {
total_cnt = 0
lower_cnt = 0
for (i=2; i<=NF; ++i) {
total_cnt += length($i)
s = $i
lower_cnt += gsub(/[a-z]/, "", s)
}
ratio = lower_cnt / total_cnt
if (ratio < 0.3) print ">"$0
}
$ awk -f seq.awk seq.txt
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct
答案 1 :(得分:2)
或者:
awk '{n=length(gensub(/[A-Z]/,"","g"));if(NF && n/length*100 < 30)print a $0;a=RT}' RS='>[a-z]+\n' file
RS='>[a-z]+\n'
- 将记录分隔符设置为包含“&gt;”的行并命名
RT
- 此值由RS匹配的内容设置
a=RT
- 保存以前的RT值
n=length(gensub(/[A-Z]/,"","g"));
- 获取小写字母的长度
if(NF && n/length*100 < 30)print a $0;
- 检查我们是否有值,小写字母的百分比小于30
答案 2 :(得分:1)
awk '/^>/{b=B;gsub( /[A-]/,"",b);
if( length( b) < length( B) * 0.3) print H "\n" B
H=$0;B="";next}
{B=( (B != "") ? B "\n" : "" ) $0}
END{ b=B;gsub( /[A-]/,"",b);
if( length( b) < length( B) * 0.3) print H "\n" B
}' YourFile
快速qnd脏,功能套件更好的打印需求
答案 3 :(得分:1)
现在我不会再使用sed
或awk
了超过2行。
#! /usr/bin/perl
use strict; # Force variable declaration.
use warnings; # Warn about dangerous language use.
sub filter # Declare a sub-routing, a function called `filter`.
{
my ($header, $body) = @_; # Give the first two function arguments the names header and body.
my $lower = $body =~ tr/a-z//; # Count the translation of the characters a-z to nothing.
print $header, $body, "\n" # Print header, body and newline,
unless $lower / length ($body) > 0.3; # unless lower characters have more than 30%.
}
my ($header, $body); # Declare two variables for header and body.
while (<>) { # Loop over all lines from stdin or a file given in the command line.
if (/^>/) { # If the line starts with >,
filter ($header, $body) # call filter with header and body,
if defined $header; # if header is defined, which is not the case at the beginning of the file.
($header, $body) = ($_, ''); # Assign the current line to header and an empty string to body.
} else {
chomp; # Remove the newline at the end of the line.
$body .= $_; # Append the line to body.
}
}
filter ($header, $body); # Filter the last record.