我有以下几行数据:
TRINITY_GG_428_c0_g1_i1_orf1 PF13499.1 EF_hand_5
TRINITY_GG_428_c0_g1_i1_orf1 PF00036.27 efhand
TRINITY_GG_428_c0_g1_i1_orf1 PF13405.1 EF_hand_4
TRINITY_GG_428_c0_g1_i1_orf1 PF13833.1 EF_hand_6
TRINITY_GG_428_c0_g1_i1_orf1 PF13202.1 EF_hand_3
TRINITY_GG_429_c0_g1_i1_orf1 PF00156.22 Pribosyltran
TRINITY_GG_431_c5_g1_i1_orf1 PF00475.13 IGPD
TRINITY_GG_461_c0_g1_i1_orf1 PF01208.12 URO-D
TRINITY_GG_461_c0_g1_i1_orf1 PF12876.2 Cellulase-like
我想要做的是将它们转换成一行:
TRINITY_GG_428_c0_g1_i1_orf1 PF13499.1 EF_hand_5 | PF00036.27 efhand | PF13405.1 EF_hand_4 | PF13833.1 EF_hand_6 | PF13202.1 EF_hand_3
TRINITY_GG_429_c0_g1_i1_orf1 PF00156.22 Pribosyltran
TRINITY_GG_431_c5_g1_i1_orf1 PF00475.13 IGPD
TRINITY_GG_461_c0_g1_i1_orf1 PF01208.12 URO-D | PF12876.2 Cellulase-like
匹配线总是彼此相邻。
如何在sed / awk / Perl / Python中解决这个问题?
答案 0 :(得分:0)
你可以使用python regex做这样的事情
import re
out_lines = []
with open('file.txt', 'r') as f:
key = None
key_lines = []
for line in f:
m = re.match(r'^(\S+)\s(.+)$', line)
k, v = m.group(1), m.group(2)
if k != key:
if key:
out_lines.append('{0} {1}'.format(key, ' | '.join(key_lines)))
key = k
key_lines = [v]
else:
key_lines.append(v)
else:
if key:
out_lines.append('{0} {1}'.format(key, ' | '.join(key_lines)))
with open('out.txt', 'w') as f:
f.write('\n'.join(out_lines))
答案 1 :(得分:0)
使用GNU sed:
:label # Label to branch to
N # Append next line to pattern space
s/^([^ ]*)( .*)\n\1(.*)$/\1\2 |\3/ # Substitution
t label # Branch to label if the substitution took place
P # Strings weren't identical: print up to newline
D # Delete up to newline, start new cycle (second line become first line)
主要部分是替换:它检查两行是否以相同的字符串开头(直到第一个空格),如果是,则连接这些行,从第二行中删除字符串并用管道替换换行符
分手:
-E
要使用BSD sed工作,我们必须在标签周围拆分命令并使用-r
标志而不是sed -E -e ':a' -e 'N;s/^([^ ]*)( .*)\n\1(.*)$/\1\2 |\3/;ta' -e 'P;D' infile
:
s/ # Start substitution
^ # Anchor at start of pattern space
([^ ]*) # Match and capture non-space characters (group #1)
( .*) # Capture up to end of line (group #2)
\n # Match newline
\1 # Start of second line: match first capture group
(.*) # Capture rest of second line (group #3)
$ # Anchor at end of pattern space
/ # Delimiter for substitution
\1\2 |\3 # Substitute: captures groups 1 and 2, space, pipe, capture group 3
/ # End of substitution
为了更好的衡量,仔细看看替换:
$lines = file('file.txt')
$lines = array_map('ltrim', $lines);
$str = implode($lines);
file_put_contents('file.txt',$str);
//echo nl2br($str) - to see your new file string
//print_r($lines) - to see file lines as array
答案 2 :(得分:0)
这是一种非常常见的编程模式。您需要使用Perl哈希来累积属于每个不同初始字段(密钥)的所有数据。然后,只需要按照所需的顺序和格式打印哈希值
该程序演示。我假设你想要按键的词汇顺序打印的键。如果您需要任何不同的内容,例如它们首次出现在源数据中的顺序,那么请说明 - 需要进行一些小改动
该程序期望输入文件的路径作为命令行上的参数,并将其输出发送到STDOUT,可以正常方式重定向
use strict;
use warnings 'all';
my %data;
while ( <> ) {
chomp;
my ($key, $val) = split ' ', $_, 2;
push @{ $data{$key} }, $val;
}
print $_, ' ', join(' | ', @{ $data{$_} }), "\n" for sort keys %data;
TRINITY_GG_428_c0_g1_i1_orf1 PF13499.1 EF_hand_5 | PF00036.27 efhand | PF13405.1 EF_hand_4 | PF13833.1 EF_hand_6 | PF13202.1 EF_hand_3
TRINITY_GG_429_c0_g1_i1_orf1 PF00156.22 Pribosyltran
TRINITY_GG_431_c5_g1_i1_orf1 PF00475.13 IGPD
TRINITY_GG_461_c0_g1_i1_orf1 PF01208.12 URO-D | PF12876.2 Cellulase-like
答案 3 :(得分:0)
只需构建所有行的串联记录,同时当前行的第一个字段与上一行中的字段相同,然后在第一个字段的值更改时打印它:
$ awk '
$1==prev { rec = rec " | " $2 " " $3 }
$1!=prev { if (NR>1) print rec; rec=$0 }
{ prev=$1 }
END { print rec }
' file
TRINITY_GG_428_c0_g1_i1_orf1 PF13499.1 EF_hand_5 | PF00036.27 efhand | PF13405.1 EF_hand_4 | PF13833.1 EF_hand_6 | PF13202.1 EF_hand_3
TRINITY_GG_429_c0_g1_i1_orf1 PF00156.22 Pribosyltran
TRINITY_GG_431_c5_g1_i1_orf1 PF00475.13 IGPD
TRINITY_GG_461_c0_g1_i1_orf1 PF01208.12 URO-D | PF12876.2 Cellulase-like
或者,如果您的输入行键不连续,并且您不关心输出顺序与输入顺序相同,并且输入文件足够小以使其全部保留在内存中,那么您可以使用散列方法建议在一个不同的答案:
$ awk '{a[$1]=($1 in a ? a[$1]" | "$2" "$3 : $0)} END{for (k in a) print a[k]}' file
TRINITY_GG_429_c0_g1_i1_orf1 PF00156.22 Pribosyltran
TRINITY_GG_461_c0_g1_i1_orf1 PF01208.12 URO-D | PF12876.2 Cellulase-like
TRINITY_GG_431_c5_g1_i1_orf1 PF00475.13 IGPD
TRINITY_GG_428_c0_g1_i1_orf1 PF13499.1 EF_hand_5 | PF00036.27 efhand | PF13405.1 EF_hand_4 | PF13833.1 EF_hand_6 | PF13202.1 EF_hand_3