使用Python查找重复数据

时间:2017-02-22 03:20:06

标签: python dataframe

我有一个数据文件,其中包含"名称,时间,数据"我希望找到重复的数据(名称,时间必须完全相似,数据(如果有任何二进制数据" 1"匹配,无论位置如何)。例如,下面的数据: 有没有可用的功能可以做到这一点?

例如:

name,time,data
tg,0x34,1111
ab,0x54,1011
k,0x34,0100
c,0x34,0001
e,0x34,0000
d, 0x34,1111

重复的结果:

tg,0x34,1111
k,0x34,0100
c,0x34,0001  
d, 0x34,1111 

1 个答案:

答案 0 :(得分:1)

这似乎是一个排序,分组,比较问题。不太确定为什么,但我选择了一个具有排序,分组和比较所需方法的类。 ...... Python 2.7代码:

设置;

import io, collections, operator, csv

s = '''name,time,data
tg,0x34,1111
ab,0x54,1011
k,0x34,0100
c,0x34,0001
e,0x34,0000
d,0x34,1111'''

# for file emulation
f = io.BytesIO(s)

保存信息并使用Python工具

class Thing(object):
    def __init__(self, name = None, time = None, data = None):
        self. name = name
        self.time = time
        self.data = data
    def __eq__(self, other):
        'For comparison'
        equal = self.time == other.time
        # equal if there is a one in both things at the same bit position
        equal = equal and bool(int(self.data, base = 2) &
                               int(other.data, base = 2))
        return equal
    def __lt__(self, other):
        'For sorting'
        return self.time < other.time
    def __str__(self):
        return '({}, {}, {})'.format(self.name, self.time, self.data)
    def __repr__(self):
        return '({}, {}, {})'.format(self.name, self.time, self.data)

使用csv模块制作Thing列表并对其进行排序(time):

reader = csv.DictReader(f)
things = [Thing(**row) for row in reader]
things.sort()

使用itertools.groupbyitertools.combination将事物与同一time进行比较。把一组中相同的东西放在一起。

results = set()
for key, group in itertools.groupby(things, key = operator.attrgetter('time')):
    print key
    for a, b in itertools.combinations(group, 2):
        if a == b:
            print '\t{} is duplicate of {}'.format(a, b)
            results.add(a)
            results.add(b)

这导致

>>> 
0x34
    (tg, 0x34, 1111) is duplicate of (k, 0x34, 0100)
    (tg, 0x34, 1111) is duplicate of (c, 0x34, 0001)
    (tg, 0x34, 1111) is duplicate of (d, 0x34, 1111)
    (k, 0x34, 0100) is duplicate of (d, 0x34, 1111)
    (c, 0x34, 0001) is duplicate of (d, 0x34, 1111)
0x54
>>> results
set([(tg, 0x34, 1111), (c, 0x34, 0001), (d, 0x34, 1111), (k, 0x34, 0100)])
>>>

不确定我是否理解规格 - 以下数据集产生零重复:

s = '''name,time,data
tg,0x34,0010
ab,0x54,1011
k,0x34,0100
c,0x34,0001
e,0x34,0000
d,0x34,1000'''

您可能希望将重复项分开保存不同的时间,因此请将时间作为键保存在字典中。

results = collections.defaultdict(set)
for key, group in itertools.groupby(things, key = operator.attrgetter('time')):
    print key
    for a, b in itertools.combinations(group, 2):
        if a == b:
            print '\t{} is duplicate of {}'.format(a, b)
            results[key].update((a,b))