Question

我尝试处理一些数据，但我无法为我的问题找到有效的解决方案。我有一个看起来像的文件：

>ram
cacacacacacacacacatatacacatacacatacacacacacacacacacacacacaca
cacacacacacacaca
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct
>sam
AATTGGCCAATTGGCAATTCCGGAATTCaattggccaattccggaattccaattccgg

and many lines more....

我想过滤掉所有行和相应的标题（标题以＆gt;开头），其中序列字符串（不以＆gt;开头）包含30％或更多的小写字母。序列字符串可以跨越多行。

因此在命令xy之后输出应该如下：

>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct

我尝试了一些while循环用于读取输入文件，然后使用awk，grep，sed但是没有好结果。

Answer 1

这是一个想法，它将记录分隔符设置为＆＃34;＆gt;＆＃34;将每个标题的序列行视为单个记录。

因为输入以＆＃34;＆gt;＆＃34;开头，这会导致初始空记录，我们使用NR > 1（记录号大于1）保护计算。

要计算字符数，我们会添加标题后所有行的长度。要计算小写字符的数量，我们将字符串保存在另一个变量中并使用gsub将所有小写字母替换为空 - 只是因为gsub返回所做的替换次数，这是一种方便的计数方式它们。

最后，我们检查比率并打印与否（在我们打印时添加回初始＆＃34;＆gt;＆＃34;

BEGIN { RS = ">" }

NR > 1 {
    total_cnt = 0
    lower_cnt = 0
    for (i=2; i<=NF; ++i) {
        total_cnt += length($i)
        s = $i
        lower_cnt += gsub(/[a-z]/, "", s)
    }
    ratio = lower_cnt / total_cnt
    if (ratio < 0.3) print ">"$0
}


$ awk -f seq.awk seq.txt
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct

Answer 2

或者：

awk '{n=length(gensub(/[A-Z]/,"","g"));if(NF && n/length*100 < 30)print a $0;a=RT}' RS='>[a-z]+\n' file

RS='>[a-z]+\n' - 将记录分隔符设置为包含“＆gt;”的行并命名
RT - 此值由RS匹配的内容设置
a=RT - 保存以前的RT值
n=length(gensub(/[A-Z]/,"","g")); - 获取小写字母的长度
if(NF && n/length*100 < 30)print a $0; - 检查我们是否有值，小写字母的百分比小于30

Answer 3

awk '/^>/{b=B;gsub( /[A-]/,"",b);
          if( length( b) < length( B) * 0.3) print H "\n" B
          H=$0;B="";next}

     {B=( (B != "") ? B "\n" : "" ) $0}

     END{ b=B;gsub( /[A-]/,"",b);
          if( length( b) < length( B) * 0.3) print H "\n" B
        }' YourFile

快速qnd脏，功能套件更好的打印需求

Answer 4

现在我不会再使用sed或awk了超过2行。

#! /usr/bin/perl
use strict;                                # Force variable declaration.
use warnings;                              # Warn about dangerous language use.

sub filter                                 # Declare a sub-routing, a function called `filter`.
{
  my ($header, $body) = @_;                # Give the first two function arguments the names header and body.
  my $lower = $body =~ tr/a-z//;           # Count the translation of the characters a-z to nothing.
  print $header, $body, "\n"               # Print header, body and newline,
    unless $lower / length ($body) > 0.3;  # unless lower characters have more than 30%.
}

my ($header, $body);                       # Declare two variables for header and body.
while (<>) {                               # Loop over all lines from stdin or a file given in the command line.
  if (/^>/) {                              # If the line starts with >,
    filter ($header, $body)                # call filter with header and body,
      if defined $header;                  # if header is defined, which is not the case at the beginning of the file.
    ($header, $body) = ($_, '');           # Assign the current line to header and an empty string to body.
  } else {
    chomp;                                 # Remove the newline at the end of the line.
    $body .= $_;                           # Append the line to body.
  }
}
filter ($header, $body);                   # Filter the last record.

删除小写字母超过30％的行

4 个答案: