根据相同的键折叠行

时间:2014-02-12 10:03:20

标签: python perl bash awk

我想根据第一列的相等性来折叠行。然后将第二列的内容添加到新的折叠表中,以逗号分隔并添加其他空格。此外,如果第二列的内容相同,则折叠它们,也就是说,如果“非恶意”在输出文件中出现两次,则只显示一次。

我在这里很新,请解释如何运行它。希望有人能帮助我!

输入(制表符分隔):

HS372_01446 non-virulent
HS372_01446 non-virulent
HS372_01446 lung
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 lung
HS372_00498 lung
HS372_00954 jointlungCNS
HS372_00954 non-virulent
HS372_00954 non-virulent
HS372_00954 moderadamentevirulenta(nose)
HS372_00954 lung

所需的输出(制表符分隔):

HS372_01446 non-virulent, lung
HS372_00498 non-virulent, lung
HS372_00954 jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung

8 个答案:

答案 0 :(得分:2)

来自命令行的Perl,

perl -lane'
  ($n, $p) =@F;
  $s{$n}++ or push @r, $n;
  $c{$n}{$p}++ or push @{$h{$n}}, $p;
  END {
    $" = ",\t";
    print "$_\t@{$h{$_}}" for @r;
  }
' file

输出

HS372_01446     non-virulent,   lung
HS372_00498     non-virulent,   lung
HS372_00954     jointlungCNS,   non-virulent,   moderadamentevirulenta(nose),  lung

答案 1 :(得分:2)

另一个Perl解决方案:

#!/usr/bin/perl
use strict;
use warnings;
use List::MoreUtils qw/uniq/;

my %hash;
while ( <DATA> )
{
    chomp;
    my ( $key, $value ) = split;
    push @{$hash{$key}}, $value;
}

while ( my ( $key, $values ) = each %hash )
{
    print "$key\t", join ', ', uniq @$values, "\n";  
}

__DATA__
HS372_01446 non-virulent
HS372_01446 non-virulent
HS372_01446 lung
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 lung
HS372_00498 lung
HS372_00954 jointlungCNS
HS372_00954 non-virulent
HS372_00954 non-virulent
HS372_00954 moderadamentevirulenta(nose)
HS372_00954 lung

答案 2 :(得分:2)

这可以满足您的要求,此外,ID和说明与文件中显示的顺序相同,以防万一:

use strict;
use warnings;

open my $fh, '<', 'diseases.txt';

my %diseases;
my @ids;

while (<$fh>) {
  my ($id, $desc) = split;
  if (not $diseases{$id}) {
    $diseases{$id}{list} = [$desc];
    $diseases{$id}{seen}{$desc} = 1;
    push @ids, $id;
  }
  elsif (not $diseases{$id}{seen}{$desc}) {
    push @{ $diseases{$id}{list} }, $desc;
    $diseases{$id}{seen}{$desc} = 1;
  }
}

for my $id (@ids) {
  printf "%s %s\n", $id, join ', ', @{ $diseases{$id}{list} };
}

<强>输出

HS372_01446 non-virulent, lung
HS372_00498 non-virulent, lung
HS372_00954 jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung

答案 3 :(得分:1)

from collections import defaultdict

a = """HS372_01446 non-virulent
HS372_01446 non-virulent
HS372_01446 lung
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 lung
HS372_00498 lung
HS372_00954 jointlungCNS
HS372_00954 non-virulent
HS372_00954 non-virulent
HS372_00954 moderadamentevirulenta(nose)
HS372_00954 lung""".split("\n")

stuff = defaultdict(set)

for line in a:
    uid, symp = line.split(" ")
    stuff[uid].add(symp)

for uid, symps in stuff.iteritems():
    print "%s %s" % (uid, ", ".join(list(symps)))

答案 4 :(得分:1)

爪哇:

javac Collapse.java

java Collapse input.txt

import java.io.*;
import java.util.*;

public class Collapse {

    public static void main(String[] args) throws Exception {
        BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(args[0])));

        Map<String, Set<String>> output = new HashMap<String, Set<String>>();
        String line;
        while ((line = br.readLine()) != null) {
            StringTokenizer st = new StringTokenizer(line, "\t");
            String key = st.nextToken();
            Set<String> set = output.get(key);
            if (set == null) {
                output.put(key, set = new LinkedHashSet<String>());
            }
            set.add(st.nextToken());
        }

        for (String key : output.keySet()) {
            StringBuilder sb = new StringBuilder();
            for (String value : output.get(key)) {
                if (sb.length() != 0) sb.append(", ");
                sb.append(value);
            }
            System.out.println(key + "\t" + sb);
        }
    }
}

答案 5 :(得分:1)

用于解析文本文件的标准UNIX工具是awk:

$ awk '!seen[$1,$2]++{a[$1]=(a[$1] ? a[$1]", " : "\t") $2} END{for (i in a) print i a[i]}' file
HS372_00498     non-virulent, lung
HS372_00954     jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung
HS372_01446     non-virulent, lung

答案 6 :(得分:0)

在perl:

use warnings;
use strict; 

open my $input, '<', 'in.txt';

my %hash;
while (<$input>){
    chomp;
    my @split = split(' ');
    $hash{$split[0]}{$split[1]} = 1;
}

for my $key (keys %hash){
    print "$key\t";
        for my $info (keys $hash{$key}){
            print "$info\t";
        }
    print "\n";
} 

打印哪些:

HS372_01446 non-virulent    lung    
HS372_00954 non-virulent    moderadamentevirulenta(nose)    jointlungCNS    lung    
HS372_00498 non-virulent    lung

答案 7 :(得分:0)

如果您的数据来自mysql数据库(您可以将其导入一个),则可以使用group_concat运算符。

看到这个答案 Can I concatenate multiple MySQL rows into one field?

目前标有431个upvotes,所以你的问题是一个非常常见的问题,答案显示了一个非常优雅的解决方案。