删除小写字母超过30%的行

时间:2017-02-21 14:49:38

标签: bash awk sed

我尝试处理一些数据,但我无法为我的问题找到有效的解决方案。我有一个看起来像的文件:

>ram
cacacacacacacacacatatacacatacacatacacacacacacacacacacacacaca
cacacacacacacaca
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct
>sam
AATTGGCCAATTGGCAATTCCGGAATTCaattggccaattccggaattccaattccgg

and many lines more....

我想过滤掉所有行和相应的标题(标题以>开头),其中序列字符串(不以>开头)包含30%或更多的小写字母。序列字符串可以跨越多行。

因此在命令xy之后输出应该如下:

>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct

我尝试了一些while循环用于读取输入文件,然后使用awk,grep,sed但是没有好结果。

4 个答案:

答案 0 :(得分:4)

这是一个想法,它将记录分隔符设置为">"将每个标题的序列行视为单个记录。

因为输入以">"开头,这会导致初始空记录,我们使用NR > 1(记录号大于1)保护计算。

要计算字符数,我们会添加标题后所有行的长度。要计算小写字符的数量,我们将字符串保存在另一个变量中并使用gsub将所有小写字母替换为空 - 只是因为gsub返回所做的替换次数,这是一种方便的计数方式它们。

最后,我们检查比率并打印与否(在我们打印时添加回初始">"

BEGIN { RS = ">" }

NR > 1 {
    total_cnt = 0
    lower_cnt = 0
    for (i=2; i<=NF; ++i) {
        total_cnt += length($i)
        s = $i
        lower_cnt += gsub(/[a-z]/, "", s)
    }
    ratio = lower_cnt / total_cnt
    if (ratio < 0.3) print ">"$0
}


$ awk -f seq.awk seq.txt
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct

答案 1 :(得分:2)

或者:

awk '{n=length(gensub(/[A-Z]/,"","g"));if(NF && n/length*100 < 30)print a $0;a=RT}' RS='>[a-z]+\n' file
  1. RS='>[a-z]+\n' - 将记录分隔符设置为包含“&gt;”的行并命名

  2. RT - 此值由RS匹配的内容设置

  3. a=RT - 保存以前的RT值

  4. n=length(gensub(/[A-Z]/,"","g")); - 获取小写字母的长度

  5. if(NF && n/length*100 < 30)print a $0; - 检查我们是否有值,小写字母的百分比小于30

答案 2 :(得分:1)

awk '/^>/{b=B;gsub( /[A-]/,"",b);
          if( length( b) < length( B) * 0.3) print H "\n" B
          H=$0;B="";next}

     {B=( (B != "") ? B "\n" : "" ) $0}

     END{ b=B;gsub( /[A-]/,"",b);
          if( length( b) < length( B) * 0.3) print H "\n" B
        }' YourFile
快速qnd脏,功能套件更好的打印需求

答案 3 :(得分:1)

现在我不会再使用sedawk了超过2行。

#! /usr/bin/perl
use strict;                                # Force variable declaration.
use warnings;                              # Warn about dangerous language use.

sub filter                                 # Declare a sub-routing, a function called `filter`.
{
  my ($header, $body) = @_;                # Give the first two function arguments the names header and body.
  my $lower = $body =~ tr/a-z//;           # Count the translation of the characters a-z to nothing.
  print $header, $body, "\n"               # Print header, body and newline,
    unless $lower / length ($body) > 0.3;  # unless lower characters have more than 30%.
}

my ($header, $body);                       # Declare two variables for header and body.
while (<>) {                               # Loop over all lines from stdin or a file given in the command line.
  if (/^>/) {                              # If the line starts with >,
    filter ($header, $body)                # call filter with header and body,
      if defined $header;                  # if header is defined, which is not the case at the beginning of the file.
    ($header, $body) = ($_, '');           # Assign the current line to header and an empty string to body.
  } else {
    chomp;                                 # Remove the newline at the end of the line.
    $body .= $_;                           # Append the line to body.
  }
}
filter ($header, $body);                   # Filter the last record.