Question

我有一个制表符分隔的文本文件，如下所示：

Gene1 ID:454,ID:575,ID:44449
Gene2 ID:4344,ID:5626,ID:4
Gene3 ID:244

并且我喜欢将htis变成长形式，例如

Gene1 ID:454
Gene1 ID:575
Gene1 ID:44449
Gene2 ID:4344
Gene2 ID:5626
Gene2 ID:4
Gene3 ID:244

我以为我可以用sed一行一行，用第一个字符串替换每个逗号到空格（GeneX）加上逗号之前的元素然后添加一个新行，但是没有做多少进展。在某些情况下，只有一个匹配（没有逗号）使解析复杂化。

是否采用了正确的方法？

Answer 1

Perl救援：

perl -ane '
           @ids = split /,/, $F[1];
           print "$F[0]\t$_\n" for @ids;
          ' < input.txt > output.txt

-n逐行读取文件
-a将空格上的每一行拆分为@F数组
split从字符串创建一个数组 - 这里，它会在逗号上分割第二个（$F[1]）字段

Answer 2

使用awk。

awk -F , '{
    # Pull off the Gene## string.
    g=substr($1, 1, index($1, " "))
    # Set the output field separator to a newline followed by the gene string.
    OFS="\n"g
    # Force awk to recombine the current line with the new value of OFS.
    # This *should*, canonically, work as $0=$0 I believe but it doesn't
    # work when I do that here and I don't know why.
    $1=$1
    print
}' input.txt > output.txt

Answer 3

这可能适合你（GNU sed）：

sed -r 's/^((\S+\s)[^,]*),/\1\n\2/;P;D' file

这将替换前面的标记后面的第一个,，然后是换行符，然后是第一个标记及其后面的空格。然后打印并丢弃第一行，重复该过程直到不再替换,。

每行每个ID匹配一行到每个ID匹配一个匹配

3 个答案: