讨论

Question

如果下面有一个输入文件，Linux中是否有任何命令/方式将其转换为我想要的文件，如下所示？

输入文件：

Column_1     Column_2  
scaffold_A   SNP_marker1
scaffold_A   SNP_marker2
scaffold_A   SNP_marker3
scaffold_A   SNP_marker4
scaffold_B   SNP_marker5
scaffold_B   SNP_marker6
scaffold_B   SNP_marker7
scaffold_C   SNP_marker8
scaffold_A   SNP_marker9
scaffold_A   SNP_marker10

所需的输出文件：

Column_1     Column_2  
scaffold_A   SNP_marker1;SNP_marker2;SNP_marker3;SNP_marker4
scaffold_B   SNP_marker5;SNP_marker6;SNP_marker7
scaffold_C   SNP_marker8
scaffold_A   SNP_marker9;SNP_marker10

我在考虑使用grep，uniq等，但仍然无法弄清楚如何完成这项工作。

Answer 1

Perl解决方案：

perl -lane 'sub output {
                print "$last\t", join ";", @buff;
            }
            $last //= $F[0];
            if ($F[0] ne $last) {
               output();
               undef @buff;
               $last = $F[0];
            }
            push @buff, $F[1];
            }{ output();'

Answer 2

python解决方案（假设文件名在命令行传入）

from __future__ import print_function #not needed with Python3
with open('infile') as infile, open('outfile', 'w') as outfile:
    outfile.write(infile.readline()) # transfer the header
    col_one, col_two = infile.readline().split()
    col_two = [col_two] # make it a list
    for line in infile:
        data = line.split()
        if col_one != data[0]:
            print("{}\t{}".format(col_one, ';'.join(col_two)), file=outfile)
            col_one = data[0]
            col_two = [data[1]]
        else:
            col_two.append(data[1])
    print("{}\t{}".format(col_one, ';'.join(col_two)), file=outfile)

Answer 3

bash脚本中的awk解决方案

#!/bin/bash awk ' BEGIN{ str = "" } { if ( str != $1 ) { if ( NR != 1 ){ printf("\n") } str = $1 printf("%s\t%s",$1,$2) } else if ( str == $1 ) { printf(";%s",$2) } } END{ printf("\n") }' your_file.txt

Answer 4

您也可以在bash中尝试以下解决方案：

cat input.txt | while read L; do y=`echo $L | cut -f1 -d' '`; { test "$x" = "$y" && echo -n ";`echo $L | cut -f2 -d' '`"; } || { x="$y";echo -en "\n$L"; }; done

或以人类更易阅读的形式进行审核：

cat input.txt | while read L;
do
  y=`echo $L | cut -f1 -d' '`;
  {
    test "$x" = "$y" && echo -n ";`echo $L | cut -f2 -d' '`";
  } || 
  {
    x="$y";echo -en "\n$L"; 
  };
done

请注意，脚本执行结果中的良好格式化输出基于bash echo命令。

Answer 5

如果你不介意使用Python，它有itertools.groupby，它就是为了这个目的：

# file: comebine.py
import itertools

with open('data.txt') as f:
    data = [row.split() for row in f]

for column1, rows_group in itertools.groupby(data, key=lambda row: row[0]):
    print column1, ';'.join(column2 for column1, column2 in rows_group)

将此脚本另存为 combine.py 。假设您的输入文件位于 data.txt 中，运行它以获得所需的输出：

python combine.py

讨论

with open(...)块的结果是data，一个行列表，每行本身就是一列列。
itertools.groupby函数接受一个可迭代的，在本例中是一个列表。您告诉它如何使用键（即column1）将行组合在一起。
rows_group是共享相同column1

在linux中组合行

5 个答案:

讨论