合并列并删除重复

时间:2014-11-05 14:43:23

标签: python file merge duplication

我有一个包含2列数据的输入文件。我需要合并两列并删除重复。任何建议如何开始?谢谢 !

输入文件

5045 2317
5045 1670
5045 2156
5045 1509
5045 3833
5045 1013
5045 3491
5045 32
5045 1482
5045 2495
5045 4280
5045 1380
5045 3998

预期输出

 5045 
 2317
 1670
 2156
 1509
 3833
 1013
 3491
 32
 1482
 2495
 4280
 1380
 3998

4 个答案:

答案 0 :(得分:1)

set1 = set()
set2 = set()
for line in myfile:
    a,b = line.strip().split()
    set1.add(int(a))
    set2.add(int(b))
set1.update(set2)

然后将set1的内容写入文件。

答案 1 :(得分:0)

我假设输出中行的顺序很重要。下面代码的输出将与您想要的输出完全匹配(例如,与使用set s的答案不同):

In [1]: with open("file.txt") as f, open("output.txt", "w") as out:
   ...:     arrs = [ l.rstrip().split() for l in f ] 
   ...:     vals = [ a for arr in arrs for a in arr ] # merge columns
   ...:     # restrict to first occurrence of each value (i.e. remove duplicates)
   ...:     uniqueVals = [ v for i, v in enumerate(vals) if vals.index(v) == i ]
   ...:     out.write("\n".join(uniqueVals))

这会将"file.txt"的输入输出到"output.txt",然后输出:

  1. 加载输入文件。
  2. 合并两列。
  3. 限制每个值的第一次出现。

答案 2 :(得分:0)

>>> import numpy as np
>>> a=np.loadtxt('file_name',delimiter=' ')
>>> a=a.flatten()
>>> a=list(set(a))
>>> a
[32.0, 3491.0, 1380.0, 1509.0, 1670.0, 1482.0, 2156.0, 2317.0, 5045.0, 4280.0, 3833.0, 2495.0, 3998.0, 1013.0]

答案 3 :(得分:0)

保持订单:

from itertools import chain
with open("in.txt") as f:
    lines = list(chain.from_iterable(x.split() for x in f))
    with open("in.txt","w") as f1:
        for ind, line in enumerate(lines,1):
            if not line in lines[:ind-1]:
                f1.write(line+"\n")

输出:

5045
2317
1670
2156
1509
3833
1013
3491
32
1482
2495
4280
1380
3998

如果订单无关紧要:

from itertools import chain
with open("in.txt") as f:
    lines = set(chain.from_iterable(x.split() for x in f))
    with open("in.txt","w") as f1:
        f1.writelines("\n".join(lines))

如果第一列中只重复了一个数字:

with open("in.txt") as f:
    col_1 = f.next().split()[0] # get first column number
    lines = set(x.split()[1] for x in f) # get all second column nums
    lines.add(col_1) # add first column num
    with open("in.txt","w") as f1:
        f1.writelines("\n".join(lines))