Question

我有一个像这样的文件

$ head test
                     gene=ENSECAG00000012421
                     note="synaptonemal complex central element protein 1
                     [Source:HGNC Symbol;Acc:28852]"
                     gene=ENSECAG00000017803
                     note="Uncharacterized protein
                     [Source:UniProtKB/TrEMBL;Acc:F6SNR9]"
                     gene=ENSECAG00000019088
                     note="cytochrome P450 2E1  [Source:RefSeq
                     peptide;Acc:NP_001104773]"
                     gene=ENSECAG00000004229

我希望看起来像这个文件看起来像这样

ENSECAG00000012421    synaptonemal complex central element protein 1 [Source:HGNC Symbol;Acc:28852]
ENSECAG00000017803    Uncharacterized protein [Source:UniProtKB/TrEMBL;Acc:F6SNR9]

我不确定笔记是否总是分为两行，所以我想要一些符合

的内容。

awk '{if(substr($1,1,4)=="gene") gene=$1; else print gene,$1}'

但是我希望它能够认识到它可能是两行，并且它们之间也有空格。所以我希望它可以打印出来的所有内容＆＃34; ＆＃34;作为第2列（理想情况下将2列分隔为\ t，以便以后不会混淆）我知道如何摆脱基因，注意和＆＃34;，但不确定它们是否有助于识别。我很高兴它成为一系列不同的命令，首先将整个音符放在一行中，然后将它与基因或一切结合在一起，无论哪种效果最好。

另外，如果你使用的是awk，你能简单解释一下你在做什么吗？

感谢您的帮助！

Answer 1

如果您有GNU awk或mawk（解决方案依赖于基于正则表达式的输入记录分隔符，严格遵守POSIX或较旧的awk实现不支持）：< / p>

简短版：

awk -v RS=' *(gene=|note="|")' '
  { gsub("\n", ""); if ($0 == "") next; $1=$1; 
    printf "%s%s", $0, (/^ENSECAG[0-9]+$/ ? "\t" : "\n") }
  ' file

带注释的版本：

-v RS=' *(gene=|note="|")' - RS是一个特殊变量，用于定义输入记录分隔符 - 指定一个正则表达式，将输入分解为感兴趣的记录 - 跨行。

awk -v RS=' *(gene=|note="|")' '
  {    
   gsub("\n", "");     # remove all newlines from record
   if ($0 == "") next  # ignore empty records
   $1=$1;              # rebuild record to compress multiple interior spaces
    # Output:
    #  - Is it a gene record, i.e. is there only 1 field that contains a gene name?
    #    Output it with just a trailing \t, but no trailing \n, so that the next
    #    note record will print on the same line.
    #  - Otherwise: a note record: print with trailing \n, effectively
    #    appending it to the previous gene record.
   printf "%s%s", $0, (/^ENSECAG[0-9]+$/ ? "\t" : "\n")
  }
  ' file

Answer 2

可能过于复杂，但这是一种方式：

/^\s*gene=/  { gene=substr($1, 6) }
/^\s*note=/  { note=substr($0, 28) }
/"$/         { if (substr($1,1,4)=="note")
                 print gene, substr($0, 28, length($0)-28);
               else
                 print gene, note, substr($0, 22, length($0)-22) }

请注意，这会处理单行和双行音符。

Answer 3

使用awk

awk 'BEGIN{FS="\n";RS="gene="}{gsub(/(note=|\")/,"");print $1,$2,$3}' file|awk '$1=$1'

ENSECAG00000012421 synaptonemal complex central element protein 1 [Source:HGNC Symbol;Acc:28852]
ENSECAG00000017803 Uncharacterized protein [Source:UniProtKB/TrEMBL;Acc:F6SNR9]
ENSECAG00000019088 cytochrome P450 2E1 [Source:RefSeq peptide;Acc:NP_001104773]
ENSECAG00000004229

Answer 4

 sed -n 'N;;/"$/!N;s/\n//g;p' input | \
   sed 's/.*gene=//;s/[ \t]*note="\([^"]*\)"/\t\1 /;s/  */ /g'

给出：

ENSECAG00000012421  synaptonemal complex central element protein 1 [Source:HGN...
ENSECAG00000017803  Uncharacterized protein [Source:Uni...
ENSECAG00000019088  cytochrome P450 2E1 [Source:Ref...

Answer 5

$ awk '{$1=$1; gsub(/"/,""); sub(/^note=/,""); pfx=(sub(/^gene=/,"")?(NR>1?ORS:""):OFS); printf "%s%s",pfx,$0} END{print ""}' file
ENSECAG00000012421 synaptonemal complex central element protein 1 [Source:HGNC Symbol;Acc:28852]
ENSECAG00000017803 Uncharacterized protein [Source:UniProtKB/TrEMBL;Acc:F6SNR9]
ENSECAG00000019088 cytochrome P450 2E1 [Source:RefSeq peptide;Acc:NP_001104773]
ENSECAG00000004229

如何合并以特定子字符串开头的行？

5 个答案: