我想根据第一列的相等性来折叠行。然后将第二列的内容添加到新的折叠表中,以逗号分隔并添加其他空格。此外,如果第二列的内容相同,则折叠它们,也就是说,如果“非恶意”在输出文件中出现两次,则只显示一次。
我在这里很新,请解释如何运行它。希望有人能帮助我!
输入(制表符分隔):
HS372_01446 non-virulent
HS372_01446 non-virulent
HS372_01446 lung
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 lung
HS372_00498 lung
HS372_00954 jointlungCNS
HS372_00954 non-virulent
HS372_00954 non-virulent
HS372_00954 moderadamentevirulenta(nose)
HS372_00954 lung
所需的输出(制表符分隔):
HS372_01446 non-virulent, lung
HS372_00498 non-virulent, lung
HS372_00954 jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung
答案 0 :(得分:2)
来自命令行的Perl,
perl -lane'
($n, $p) =@F;
$s{$n}++ or push @r, $n;
$c{$n}{$p}++ or push @{$h{$n}}, $p;
END {
$" = ",\t";
print "$_\t@{$h{$_}}" for @r;
}
' file
输出
HS372_01446 non-virulent, lung
HS372_00498 non-virulent, lung
HS372_00954 jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung
答案 1 :(得分:2)
另一个Perl解决方案:
#!/usr/bin/perl
use strict;
use warnings;
use List::MoreUtils qw/uniq/;
my %hash;
while ( <DATA> )
{
chomp;
my ( $key, $value ) = split;
push @{$hash{$key}}, $value;
}
while ( my ( $key, $values ) = each %hash )
{
print "$key\t", join ', ', uniq @$values, "\n";
}
__DATA__
HS372_01446 non-virulent
HS372_01446 non-virulent
HS372_01446 lung
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 lung
HS372_00498 lung
HS372_00954 jointlungCNS
HS372_00954 non-virulent
HS372_00954 non-virulent
HS372_00954 moderadamentevirulenta(nose)
HS372_00954 lung
答案 2 :(得分:2)
这可以满足您的要求,此外,ID和说明与文件中显示的顺序相同,以防万一:
use strict;
use warnings;
open my $fh, '<', 'diseases.txt';
my %diseases;
my @ids;
while (<$fh>) {
my ($id, $desc) = split;
if (not $diseases{$id}) {
$diseases{$id}{list} = [$desc];
$diseases{$id}{seen}{$desc} = 1;
push @ids, $id;
}
elsif (not $diseases{$id}{seen}{$desc}) {
push @{ $diseases{$id}{list} }, $desc;
$diseases{$id}{seen}{$desc} = 1;
}
}
for my $id (@ids) {
printf "%s %s\n", $id, join ', ', @{ $diseases{$id}{list} };
}
<强>输出强>
HS372_01446 non-virulent, lung
HS372_00498 non-virulent, lung
HS372_00954 jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung
答案 3 :(得分:1)
from collections import defaultdict
a = """HS372_01446 non-virulent
HS372_01446 non-virulent
HS372_01446 lung
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 lung
HS372_00498 lung
HS372_00954 jointlungCNS
HS372_00954 non-virulent
HS372_00954 non-virulent
HS372_00954 moderadamentevirulenta(nose)
HS372_00954 lung""".split("\n")
stuff = defaultdict(set)
for line in a:
uid, symp = line.split(" ")
stuff[uid].add(symp)
for uid, symps in stuff.iteritems():
print "%s %s" % (uid, ", ".join(list(symps)))
答案 4 :(得分:1)
爪哇:
javac Collapse.java
java Collapse input.txt
import java.io.*;
import java.util.*;
public class Collapse {
public static void main(String[] args) throws Exception {
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(args[0])));
Map<String, Set<String>> output = new HashMap<String, Set<String>>();
String line;
while ((line = br.readLine()) != null) {
StringTokenizer st = new StringTokenizer(line, "\t");
String key = st.nextToken();
Set<String> set = output.get(key);
if (set == null) {
output.put(key, set = new LinkedHashSet<String>());
}
set.add(st.nextToken());
}
for (String key : output.keySet()) {
StringBuilder sb = new StringBuilder();
for (String value : output.get(key)) {
if (sb.length() != 0) sb.append(", ");
sb.append(value);
}
System.out.println(key + "\t" + sb);
}
}
}
答案 5 :(得分:1)
用于解析文本文件的标准UNIX工具是awk:
$ awk '!seen[$1,$2]++{a[$1]=(a[$1] ? a[$1]", " : "\t") $2} END{for (i in a) print i a[i]}' file
HS372_00498 non-virulent, lung
HS372_00954 jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung
HS372_01446 non-virulent, lung
答案 6 :(得分:0)
在perl:
use warnings;
use strict;
open my $input, '<', 'in.txt';
my %hash;
while (<$input>){
chomp;
my @split = split(' ');
$hash{$split[0]}{$split[1]} = 1;
}
for my $key (keys %hash){
print "$key\t";
for my $info (keys $hash{$key}){
print "$info\t";
}
print "\n";
}
打印哪些:
HS372_01446 non-virulent lung
HS372_00954 non-virulent moderadamentevirulenta(nose) jointlungCNS lung
HS372_00498 non-virulent lung
答案 7 :(得分:0)
如果您的数据来自mysql数据库(您可以将其导入一个),则可以使用group_concat
运算符。
看到这个答案 Can I concatenate multiple MySQL rows into one field?
目前标有431个upvotes,所以你的问题是一个非常常见的问题,答案显示了一个非常优雅的解决方案。