如何合并以特定子字符串开头的行?

时间:2014-04-11 06:00:42

标签: unix awk printf substr

我有一个像这样的文件

$ head test
                     gene=ENSECAG00000012421
                     note="synaptonemal complex central element protein 1
                     [Source:HGNC Symbol;Acc:28852]"
                     gene=ENSECAG00000017803
                     note="Uncharacterized protein
                     [Source:UniProtKB/TrEMBL;Acc:F6SNR9]"
                     gene=ENSECAG00000019088
                     note="cytochrome P450 2E1  [Source:RefSeq
                     peptide;Acc:NP_001104773]"
                     gene=ENSECAG00000004229

我希望看起来像这个文件看起来像这样

ENSECAG00000012421    synaptonemal complex central element protein 1 [Source:HGNC Symbol;Acc:28852]
ENSECAG00000017803    Uncharacterized protein [Source:UniProtKB/TrEMBL;Acc:F6SNR9]

我不确定笔记是否总是分为两行,所以我想要一些符合

的内容。
awk '{if(substr($1,1,4)=="gene") gene=$1; else print gene,$1}'

但是我希望它能够认识到它可能是两行,并且它们之间也有空格。所以我希望它可以打印出来的所有内容" "作为第2列(理想情况下将2列分隔为\ t,以便以后不会混淆) 我知道如何摆脱基因,注意和",但不确定它们是否有助于识别。 我很高兴它成为一系列不同的命令,首先将整个音符放在一行中,然后将它与基因或一切结合在一起,无论哪种效果最好。

另外,如果你使用的是awk,你能简单解释一下你在做什么吗?

感谢您的帮助!

5 个答案:

答案 0 :(得分:2)

如果您有GNU awkmawk(解决方案依赖于基于正则表达式的输入记录分隔符,严格遵守POSIX或较旧的awk实现不支持):< / p>

简短版:

awk -v RS=' *(gene=|note="|")' '
  { gsub("\n", ""); if ($0 == "") next; $1=$1; 
    printf "%s%s", $0, (/^ENSECAG[0-9]+$/ ? "\t" : "\n") }
  ' file

带注释的版本:

-v RS=' *(gene=|note="|")' - RS是一个特殊变量,用于定义输入记录分隔符 - 指定一个正则表达式,将输入分解为感兴趣的记录 - 跨行。

awk -v RS=' *(gene=|note="|")' '
  {    
   gsub("\n", "");     # remove all newlines from record
   if ($0 == "") next  # ignore empty records
   $1=$1;              # rebuild record to compress multiple interior spaces
    # Output:
    #  - Is it a gene record, i.e. is there only 1 field that contains a gene name?
    #    Output it with just a trailing \t, but no trailing \n, so that the next
    #    note record will print on the same line.
    #  - Otherwise: a note record: print with trailing \n, effectively
    #    appending it to the previous gene record.
   printf "%s%s", $0, (/^ENSECAG[0-9]+$/ ? "\t" : "\n")
  }
  ' file

答案 1 :(得分:1)

可能过于复杂,但这是一种方式:

/^\s*gene=/  { gene=substr($1, 6) }
/^\s*note=/  { note=substr($0, 28) }
/"$/         { if (substr($1,1,4)=="note")
                 print gene, substr($0, 28, length($0)-28);
               else
                 print gene, note, substr($0, 22, length($0)-22) }

请注意,这会处理单行和双行音符。

答案 2 :(得分:0)

使用awk

awk 'BEGIN{FS="\n";RS="gene="}{gsub(/(note=|\")/,"");print $1,$2,$3}' file|awk '$1=$1'

ENSECAG00000012421 synaptonemal complex central element protein 1 [Source:HGNC Symbol;Acc:28852]
ENSECAG00000017803 Uncharacterized protein [Source:UniProtKB/TrEMBL;Acc:F6SNR9]
ENSECAG00000019088 cytochrome P450 2E1 [Source:RefSeq peptide;Acc:NP_001104773]
ENSECAG00000004229

答案 3 :(得分:0)

 sed -n 'N;;/"$/!N;s/\n//g;p' input | \
   sed 's/.*gene=//;s/[ \t]*note="\([^"]*\)"/\t\1 /;s/  */ /g'

给出:

ENSECAG00000012421  synaptonemal complex central element protein 1 [Source:HGN...
ENSECAG00000017803  Uncharacterized protein [Source:Uni...
ENSECAG00000019088  cytochrome P450 2E1 [Source:Ref...

答案 4 :(得分:0)

$ awk '{$1=$1; gsub(/"/,""); sub(/^note=/,""); pfx=(sub(/^gene=/,"")?(NR>1?ORS:""):OFS); printf "%s%s",pfx,$0} END{print ""}' file
ENSECAG00000012421 synaptonemal complex central element protein 1 [Source:HGNC Symbol;Acc:28852]
ENSECAG00000017803 Uncharacterized protein [Source:UniProtKB/TrEMBL;Acc:F6SNR9]
ENSECAG00000019088 cytochrome P450 2E1 [Source:RefSeq peptide;Acc:NP_001104773]
ENSECAG00000004229