如何删除文件中的重复条目

时间:2015-01-14 08:19:48

标签: python perl file

我有一个文件(input.txt),其中包含以下行:

1_306500682 2_315577060 3_315161284 22_315577259 22_315576763 

2_315578866 2_315579020 3_315163106 1_306500983 

2_315579517 3_315162181 1_306502338 2_315578919 

1_306500655 2_315579567 3_315161256 3_315161708 

由此,我只希望在_之前保留每行中具有重复值的第一个条目。对于上面的示例,output.txt应包含:

1_306500682 2_315577060 3_315161284 22_315577259 

2_315578866 3_315163106 1_306500983 

2_315579517 3_315162181 1_306502338 

1_306500655 2_315579567 3_315161256 

plz help ..

3 个答案:

答案 0 :(得分:2)

来自命令行的Perl,

perl -lane 'my %s;print join " ", grep /^(\d+)_/ && !$s{$1}++, @F' file

输出

1_306500682 2_315577060 3_315161284 22_315577259

2_315578866 3_315163106 1_306500983

2_315579517 3_315162181 1_306502338

1_306500655 2_315579567 3_315161256

答案 1 :(得分:0)

with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
    for line in infile:
        seen = set()
        nums = line.split()
        for num in nums:
            header = num.split("_")[0]
            if header not in seen:
                outfile.write(num)
                outfile.write(" ")
            seen.add(header)
        outfile.write('\n')

答案 2 :(得分:0)

您可以使用单独的set来跟踪到目前为止遇到的单词前缀,并将每行中的非重复单词收集到list中。在以这种方式处理每一行之后,可以容易地构造替换的文本行,其仅包含找到的非重复条目。注意:这只是inspectorG4dget当前答案的一个稍微高效的版本。

with open('input.txt', 'rt') as infile, \
     open('non_repetitive_input.txt', 'wt') as outfile:
    for line in infile:
        values, prefixes = [], set()
        for word, prefix in ((entry, entry.partition('_')[0])
                                for entry in line.split()):
            if prefix not in prefixes:
                values.append(word)
                prefixes.add(prefix)
        outfile.write(' '.join(values) + '\n')

输出文件的内容:

1_306500682 2_315577060 3_315161284 22_315577259
2_315578866 3_315163106 1_306500983
2_315579517 3_315162181 1_306502338
1_306500655 2_315579567 3_315161256