Question

我现在面临文件修剪问题。我想修剪制表符分隔文件中的行。

规则是：对于在两列中具有相同值的行，仅保留第三列中具有最大值的行。可能存在由两列定义的不同数量的此类冗余行。如果第三列中存在最大值的并列，则保留第一列（在订购文件之后）。

（1）我的文件看起来像（制表符分隔，有几百万行）：

（2）我想要的输出：

1 100 25 T
1 101 30 A
1 102 40 T

这个问题是我真正的学习，而不是家庭作业。我期望得到你的帮助，因为我限制了编程技巧。我更喜欢计算效率高的方法，因为我的数据文件中有很多行。你的帮助对我来说非常有价值。

Answer 1

这是一个依赖于已经正确排序的输入文件的解决方案。它将逐行扫描具有相似开始的行（例如，两个第一列相同），检查第三列值并保留具有最高值的行 - 或者在文件中首先出现的行。找到新的开始时，它会打印旧行，然后再次开始检查。

在输入文件的末尾，打印出内存中的最大行。

use warnings;
use strict;

my ($max_line, $start, $max) = parse_line(scalar <DATA>);
while (<DATA>) {
    my ($line, $nl_start, $nl_max) = parse_line($_);
    if ($nl_start eq $start) {
        if ($nl_max > $max) {
            $max_line = $line;
            $max = $nl_max;
        }
    } else {
        print $max_line;
        $start = $nl_start;
        $max = $nl_max;
        $max_line = $line;
    }
}

print $max_line;

sub parse_line {
    my $line = shift;
    my ($start, $max) = $line =~ /^([^\t]+\t[^\t]+\t)(\d+)/;
    return ($line, $start, $max);
}
__DATA__
1   100 25  T
1   101 26  A
1   101 27  G
1   101 30  A
1   102 40  A
1   102 40  T

输出结果为：

1       100     25      T
1       101     30      A
1       102     40      A

你说

如果有最大的平局第三列中的值，保留第一个（订购后）文件）。

这是相当神秘的。然后你要求输出似乎与此相矛盾，其中 last 值被打印而不是第一个。

我假设你的意思是“保留第一个价值”。如果您的确意味着“保留最后一个值”，则只需将>中的if ($nl_max > $max)符号更改为>=即可。这将有效地保留最后一个值而不是第一个值。

如果您暗示某种排序，“在订购文件后”似乎暗示，那么我没有足够的信息来了解您的意思。

Answer 2

这是一种方法

use strict;
use warnings;
use constant 
    { LINENO => 0
    , LINE   => 1
    , SCORE  => 2
    };
use English qw<$INPUT_LINE_NUMBER>;

my %hash;
while ( <> ) { 
    # split the line to get the fields
    my @fields = split /\t/;
    # Assemble a key for everything except the "score"
    my $key    = join( '-', @fields[0,1] );
    # locally cache the score
    my $score  = $fields[SCORE];

    # if we have a score, and the current is not greater, then next
    next unless ( $hash{ $key } and $score > $hash{ $key }[SCORE];
    # store the line number, line text, and score
    $hash{ $key } = [ $INPUT_LINE_NUMBER, $_, $score ]; 
}

# sort by line number and print out the text of the line stored.
foreach my $struct ( sort { $a->[LINENO] <=> $b->[LINENO] } values %hash ) {
    print $struct->[LINE]; 
}

Answer 3

在python中，您可以尝试以下代码：

res = {}
for line in (line.split() for line in open('c:\\inpt.txt','r') if line):
    line = tuple(line)
    if not line[:2] in res:
        res[line[:2]] = line[2:]
        continue
    elif res[line[:2]][0] <= line[3]:
        res[line[:2]] = line[2:]

f = open('c:\\tst.txt','w')
[f.write(line) for line in ('\t'.join(k+v)+'\n' for k,v in res.iteritems())]
f.close()

Answer 4

在Python中也是如此，但更干净的imo

import csv
spamReader = csv.reader(open('eggs'), delimiter='\t')
select = {}
for row in spamReader:
    first_two, three = (row[0], row[1]), row[2]
    if first_two in select:
         if select[first_two][2] > three:
             continue
    select[first_two] = row

spamWriter = csv.writer(open('ham', 'w'), delimiter='\t')
for line in select:
    spamWrite.writerow(select[line])

如何修剪文件 - 对于在两列中具有相同值的行，仅保留另一列中具有max的行

4 个答案: