我有4个字段的数据行
aaaa bbb1 cccc dddd
aaaa bbb2 cccc dddd
aaaa bbb3 cccc eeee
aaaa bbb4 cccc ffff
aaaa bbb5 cccc gggg
aaaa bbb6 cccc dddd
请耐心等待。
第一个和第三个字段始终相同 - 但我不需要它们,第四个字段可以相同或不同。问题是,我只想要不共享公共字段的第2和第4个字段。例如,来自上述数据
bbb3 eeee
bbb4 ffff
bbb5 gggg
现在我并不是指重复数据删除,因为这会留下其中一个条目。如果第4个字段与另一行共享一个值,我不希望任何行具有该值。
最简单的道歉再次询问什么可能是简单的。
答案 0 :(得分:6)
你走了:
from collections import defaultdict
LINES = """\
aaaa bbb1 cccc dddd
aaaa bbb2 cccc dddd
aaaa bbb3 cccc eeee
aaaa bbb4 cccc ffff
aaaa bbb5 cccc gggg
aaaa bbb6 cccc dddd""".split('\n')
# Count how many lines each unique value of the fourth field appears in.
d_counts = defaultdict(int)
for line in LINES:
a, b, c, d = line.split()
d_counts[d] += 1
# Print only those lines with a unique value for the fourth field.
for line in LINES:
a, b, c, d = line.split()
if d_counts[d] == 1:
print b, d
# Prints
# bbb3 eeee
# bbb4 ffff
# bbb5 gggg
答案 1 :(得分:0)
对于放大的要求,您可以避免两次读取文件或将其保存在列表中:
LINES = """\
aaaa bbb1 cccc dddd
aaaa bbb2 cccc dddd
aaaa bbb3 cccc eeee
aaaa bbb4 cccc ffff
aaaa bbb5 cccc gggg
aaaa bbb6 cccc dddd""".split('\n')
import collections
adict = collections.defaultdict(list)
for line in LINES: # or file ...
a, b, c, d = line.split()
adict[d].append(b)
map_b_to_d = dict((blist[0], d) for d, blist in adict.items() if len(blist) == 1)
print(map_b_to_d)
# alternative; saves some memory
xdict = {}
duplicated = object()
for line in LINES: # or file ...
a, b, c, d = line.split()
xdict[d] = duplicated if d in xdict else b
map_b_to_d2 = dict((b, d) for d, b in xdict.items() if b is not duplicated)
print(map_b_to_d2)