使用Grep在多行上搜索模式

时间:2014-06-20 10:57:54

标签: linux grep

我在生物信息学方面工作,我需要在大文件中计算GATTACCA等模式,如下所示:
“ATTTCCCGATCCGAG GATT (/ n)
ACCA CGTAGATGATACACGT(等)“ 有没有办法让Grep忽略/ n新行字符? 谢谢你的帮助!

4 个答案:

答案 0 :(得分:1)

我认为这可能会做你想要的:

tr -d '\n' < file | grep -o GATTACCA

它(临时)从您的文件中删除换行符(使用tr及其-d选项删除),然后将其传递到grep

答案 1 :(得分:1)

您可以通过awk和grep来计算文件中单词GATTACCA的出现次数,

awk -v RS="\0" '{gsub (/\n/,""); print}' file | grep -o 'GATTACCA' | wc -l

<强>解释

RS="\0"            #  Turns the input file into a single record.

gsub (/\n/,"")     #  Removes all the \n character.

grep -o 'GATTACCA' # From the awk output, it fetches the string GATTACCA and prints every match in a new line.

wc -l              #   To count the number of lines

答案 2 :(得分:1)

使用sedgrep

sed -n 'H;x;s/\n//g;/GATTACCA/p' input | grep -o GATTACCA

答案 3 :(得分:0)

你已经有两个很好的一般答案了。另一种方法是使用sed

perl -pe 's/\n//' file | grep -o GATACA

但是,如果您正在处理fasta文件,这可能很有趣:

#! /bin/sh
gawk '{
        if (substr($1,1,1)==">")
        if (NR>1)
                    printf "\n%s\t", substr($0,2,length($0)-1)
        else 
            printf "%s\t", substr($0,2,length($0)-1)
        else 
                printf "%s", $0
}END{printf "\n"}'  "$@"

上面的脚本将fasta格式更改为tbl(seq IDsequence,全部在同一行)。我经常用它来进行grepping:

FastaToTbl foo.fa | grep GATTACA 

我还有一个TblToFasta来恢复原始文件:

#! /bin/sh
# tbl-to-fasta.awk transforms a tbl file into a fasta file, 60 columns per record
# usage=gawk -f tbl-to-fasta TBL_FILE 


gawk '{
  sequence=$NF

  ls = length(sequence)
  is = 1
  fld  = 1

  while (fld < NF)
  {
     if (fld == 1){printf ">"}
     printf "%s " , $fld

     if (fld == NF-1)
      {
        printf "\n"
      }
      fld = fld+1
  }

  while (is <= ls)
  {
    printf "%s\n", substr(sequence,is,60)
    is=is+60
  }
}' "$@"