我有一个非常大的制表符分隔的文本文件。文件中的许多行对于文件中的一列具有相同的值。我想把它们放在同一行。例如:
a foo
a bar
a foo2
b bar
c bar2
运行脚本后,它应该成为:
a foo;bar;foo2
b bar
c bar2
如何在shell脚本或Python中执行此操作?
感谢。
答案 0 :(得分:3)
使用awk你可以尝试这个
{ a[$1] = a[$1] ";" $2 }
END { for (item in a ) print item, a[item] }
因此,如果将此awk脚本保存在名为awkf.awk的文件中,并且输入文件为ifile.txt,则运行脚本
awk -f awkf.awk ifile.txt | sed 's/ ;/ /'
sed脚本是删除前导;
希望这有帮助
答案 1 :(得分:2)
from collections import defaultdict
items = defaultdict(list)
for line in open('sourcefile'):
key, val = line.split('\t')
items[key].append(val)
result = open('result', 'w')
for k in sorted(items):
result.write('%s\t%s\n' % (k, ';'.join(items[k])))
result.close()
未经测试
答案 2 :(得分:1)
使用Python 2.7进行测试:
import csv
data = {}
reader = csv.DictReader(open('infile','r'),fieldnames=['key','value'],delimiter='\t')
for row in reader:
if row['key'] in data:
data[row['key']].append(row['value'])
else:
data[row['key']] = [row['value']]
writer = open('outfile','w')
for key in data:
writer.write(key + '\t' + ';'.join(data[key]) + '\n')
writer.close()
答案 3 :(得分:0)
def compress(infilepath, outfilepath):
input = open(infilepath, 'r')
output = open(outfilepath, 'w')
prev_index = None
for line in input:
index, val = line.split('\t')
if index == prev_index:
output.write(";%s" %val)
else:
output.write("\n%s %s" %(index, val))
input.close()
output.close()
未经测试,但应该有效。如果有任何疑虑,请发表评论
答案 4 :(得分:0)
Perl方法:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
open my $fh, '<', 'path/to/file' or die "unable to open file:$!";
my %res;
while(<$fh>) {
my ($k, $v) = split;
push @{$res{$k}}, $v;
}
print Dumper \%res;
<强>输出:强>
$VAR1 = {
'c' => [
'bar2'
],
'a' => [
'foo',
'bar',
'foo2'
],
'b' => [
'bar'
]
};
答案 5 :(得分:0)
#! /usr/bin/env perl
use strict;
use warnings;
# for demo only
*ARGV = *DATA;
my %record;
my @order;
while (<>) {
chomp;
my($key,$combine) = split;
push @order, $key unless exists $record{$key};
push @{ $record{$key} }, $combine;
}
print $_, "\t", join(";", @{ $record{$_} }), "\n" for @order;
__DATA__
a foo
a bar
a foo2
b bar
c bar2
输出(标签转换为空格,因为Stack Overflow会破坏输出):
a foo;bar;foo2 b bar c bar2