Question

我有一个包含大量类型记录的文件：

|1|a|b|c|||||||
|1||||aa|bb|cc||||
|1|||||||aaa|bbb|ccc|
|2|fd|ef|gf|||||||
|1||||zz|yy|dd||||

我需要在第一个字段中合并具有相同值的记录，以便理想情况下它看起来像这样（假设最后一条记录是最新记录）：

|1|a|b|c|zz|yy|dd|aaa|bbb|ccc|
|2|fd|ef|gf|||||||

我一直在考虑最好的方法。我已经考虑过将所有内容放在数据库表中，使用第一个字段作为主键，我也一直在研究perl的哈希...但没有什么听起来很理想。思考？ perl或python中的东西会很棒，但我几乎可以在unix上运行任何东西。

谢谢！

Answer 1

my %merged_rows;
while (<>) {
   chomp;
   my @fields = split(/\|/, $_, -1);
   my $id = $fields[1];
   my $merged_row = $merged_rows{$id} ||= [];

   $merged_row->[$_] = $fields[$_]
      for grep { length($fields[$_]) || $_ > $#$merged_row } 0..$#fields;
}

for my $id ( sort { $a <=> $b } keys(%merged_rows) ) {
   print(join('|', @{ $merged_rows{$id} }), "\n");
}

如果键都是小数字，你可以通过使用数组而不是哈希来保持合并的行来获得小的速度提升。

如果没有限制，split会删除空的尾随字段，因此|1|a|b|c|||||||将与|1|a|b|c相同。
$z = $x ||= $y;与$x ||= $y; $z = $x;
$x ||= $y;与$x = $x || $y;基本相同;如果LHS是假的，它会将RHS分配给LHS。在上下文中，如果这是我们第一次遇到$merged_rows{$id} = [];，它会$id。
[]创建一个空数组并返回对它的引用。

Answer 2

这是python中的一个脚本。

只有当部分按行的顺序不为空时才会覆盖。

from collections import defaultdict

def merge_lines():
    with open('data.txt', 'r') as file:
        with open('output.txt', 'w') as file_out:
            output_dict = defaultdict(list)
            for line in file:
                split_line = line.split('|')
                # Remove first empty string
                del split_line[0]
                # If we havn't seen this record before then add it to dictionary
                if split_line[0] not in output_dict:
                    output_dict[split_line[0]] = split_line
                else:
                    # If we have seen it then update the sections providing 
                    # they are not emptystring ('')
                    for index, val in enumerate(split_line):
                        if val != '':
                            output_dict[split_line[0]][index] = val

            # Join sections back together and write lines to file
            for line_values in output_dict.values():
                file_out.write('|' + '|'.join(line_values))


if __name__ == "__main__":
    merge_lines()

Answer 3

def update_col(l1,l2):
    for i,v in enumerate(l2):
        if not v:
            continue
        l1[i] = v

out = []
for l in open('rec.txt'):
    l = l.strip().split('|')
    for r in out:
            if r[1] == l[1]:
                    update_col(r,l)
                    break
    else:   
            out.append(l)

for l in out:
    print '|'.join(l)

输出
|1|a|b|c|zz|yy|dd|aaa|bbb|ccc| |2|fd|ef|gf|||||||

将记录合并为一个

3 个答案: