Python删除所有在字段中具有共同价值的行

时间:2009-07-06 22:46:13

标签: python duplicate-removal

我有4个字段的数据行

aaaa bbb1 cccc dddd  
aaaa bbb2 cccc dddd  
aaaa bbb3 cccc eeee  
aaaa bbb4 cccc ffff  
aaaa bbb5 cccc gggg  
aaaa bbb6 cccc dddd    

请耐心等待。

第一个和第三个字段始终相同 - 但我不需要它们,第四个字段可以相同或不同。问题是,我只想要不共享公共字段的第2和第4个字段。例如,来自上述数据

bbb3 eeee  
bbb4 ffff    
bbb5 gggg    

现在我并不是指重复数据删除,因为这会留下其中一个条目。如果第4个字段与另一行共享一个值,我不希望任何行具有该值。

最简单的道歉再次询问什么可能是简单的。

2 个答案:

答案 0 :(得分:6)

你走了:

from collections import defaultdict

LINES = """\
aaaa bbb1 cccc dddd
aaaa bbb2 cccc dddd
aaaa bbb3 cccc eeee
aaaa bbb4 cccc ffff
aaaa bbb5 cccc gggg
aaaa bbb6 cccc dddd""".split('\n')

# Count how many lines each unique value of the fourth field appears in.
d_counts = defaultdict(int)
for line in LINES:
    a, b, c, d = line.split()
    d_counts[d] += 1

# Print only those lines with a unique value for the fourth field.
for line in LINES:
    a, b, c, d = line.split()
    if d_counts[d] == 1:
        print b, d

# Prints
# bbb3 eeee
# bbb4 ffff
# bbb5 gggg

答案 1 :(得分:0)

对于放大的要求,您可以避免两次读取文件或将其保存在列表中:

LINES = """\
aaaa bbb1 cccc dddd
aaaa bbb2 cccc dddd
aaaa bbb3 cccc eeee
aaaa bbb4 cccc ffff
aaaa bbb5 cccc gggg
aaaa bbb6 cccc dddd""".split('\n')

import collections
adict = collections.defaultdict(list)
for line in LINES: # or file ...
    a, b, c, d = line.split()
    adict[d].append(b)

map_b_to_d = dict((blist[0], d) for d, blist in adict.items() if len(blist) == 1)
print(map_b_to_d)

# alternative; saves some memory

xdict = {}
duplicated = object()
for line in LINES: # or file ...
    a, b, c, d = line.split()
    xdict[d] = duplicated if d in xdict else b

map_b_to_d2 = dict((b, d) for d, b in xdict.items() if b is not duplicated)
print(map_b_to_d2)