我有一个像这样的文件
$ head test
gene=ENSECAG00000012421
note="synaptonemal complex central element protein 1
[Source:HGNC Symbol;Acc:28852]"
gene=ENSECAG00000017803
note="Uncharacterized protein
[Source:UniProtKB/TrEMBL;Acc:F6SNR9]"
gene=ENSECAG00000019088
note="cytochrome P450 2E1 [Source:RefSeq
peptide;Acc:NP_001104773]"
gene=ENSECAG00000004229
我希望看起来像这个文件看起来像这样
ENSECAG00000012421 synaptonemal complex central element protein 1 [Source:HGNC Symbol;Acc:28852]
ENSECAG00000017803 Uncharacterized protein [Source:UniProtKB/TrEMBL;Acc:F6SNR9]
我不确定笔记是否总是分为两行,所以我想要一些符合
的内容。awk '{if(substr($1,1,4)=="gene") gene=$1; else print gene,$1}'
但是我希望它能够认识到它可能是两行,并且它们之间也有空格。所以我希望它可以打印出来的所有内容" "作为第2列(理想情况下将2列分隔为\ t,以便以后不会混淆) 我知道如何摆脱基因,注意和",但不确定它们是否有助于识别。 我很高兴它成为一系列不同的命令,首先将整个音符放在一行中,然后将它与基因或一切结合在一起,无论哪种效果最好。
另外,如果你使用的是awk,你能简单解释一下你在做什么吗?
感谢您的帮助!
答案 0 :(得分:2)
如果您有GNU awk
或mawk
(解决方案依赖于基于正则表达式的输入记录分隔符,严格遵守POSIX或较旧的awk
实现不支持):< / p>
简短版:
awk -v RS=' *(gene=|note="|")' '
{ gsub("\n", ""); if ($0 == "") next; $1=$1;
printf "%s%s", $0, (/^ENSECAG[0-9]+$/ ? "\t" : "\n") }
' file
带注释的版本:
-v RS=' *(gene=|note="|")'
- RS
是一个特殊变量,用于定义输入记录分隔符 - 指定一个正则表达式,将输入分解为感兴趣的记录 - 跨行。
awk -v RS=' *(gene=|note="|")' '
{
gsub("\n", ""); # remove all newlines from record
if ($0 == "") next # ignore empty records
$1=$1; # rebuild record to compress multiple interior spaces
# Output:
# - Is it a gene record, i.e. is there only 1 field that contains a gene name?
# Output it with just a trailing \t, but no trailing \n, so that the next
# note record will print on the same line.
# - Otherwise: a note record: print with trailing \n, effectively
# appending it to the previous gene record.
printf "%s%s", $0, (/^ENSECAG[0-9]+$/ ? "\t" : "\n")
}
' file
答案 1 :(得分:1)
可能过于复杂,但这是一种方式:
/^\s*gene=/ { gene=substr($1, 6) }
/^\s*note=/ { note=substr($0, 28) }
/"$/ { if (substr($1,1,4)=="note")
print gene, substr($0, 28, length($0)-28);
else
print gene, note, substr($0, 22, length($0)-22) }
请注意,这会处理单行和双行音符。
答案 2 :(得分:0)
使用awk
awk 'BEGIN{FS="\n";RS="gene="}{gsub(/(note=|\")/,"");print $1,$2,$3}' file|awk '$1=$1'
ENSECAG00000012421 synaptonemal complex central element protein 1 [Source:HGNC Symbol;Acc:28852]
ENSECAG00000017803 Uncharacterized protein [Source:UniProtKB/TrEMBL;Acc:F6SNR9]
ENSECAG00000019088 cytochrome P450 2E1 [Source:RefSeq peptide;Acc:NP_001104773]
ENSECAG00000004229
答案 3 :(得分:0)
sed -n 'N;;/"$/!N;s/\n//g;p' input | \
sed 's/.*gene=//;s/[ \t]*note="\([^"]*\)"/\t\1 /;s/ */ /g'
给出:
ENSECAG00000012421 synaptonemal complex central element protein 1 [Source:HGN...
ENSECAG00000017803 Uncharacterized protein [Source:Uni...
ENSECAG00000019088 cytochrome P450 2E1 [Source:Ref...
答案 4 :(得分:0)
$ awk '{$1=$1; gsub(/"/,""); sub(/^note=/,""); pfx=(sub(/^gene=/,"")?(NR>1?ORS:""):OFS); printf "%s%s",pfx,$0} END{print ""}' file
ENSECAG00000012421 synaptonemal complex central element protein 1 [Source:HGNC Symbol;Acc:28852]
ENSECAG00000017803 Uncharacterized protein [Source:UniProtKB/TrEMBL;Acc:F6SNR9]
ENSECAG00000019088 cytochrome P450 2E1 [Source:RefSeq peptide;Acc:NP_001104773]
ENSECAG00000004229