如何删除基于日期的重复元素

时间:2017-03-30 07:25:19

标签: python algorithm list data-structures

我有包含列表清单的dictinoary

find_dup = {"one":[["1654","raj","425","16-02-2017"],["1654","mo","426","20-02-2017"],["1654","ss","425","20-02-2017"],["1654","vs","427","20-02-2017"],["1654","ss","425","14-02-2017"]]}

我想在第一和第三元素列表中找到重复的内容

例如

["1654","raj","425","16-02-2017"] -> 1654,425
["1654","mo","426","20-02-2017"] -> 1654,426
["1654","ss","425","20-02-2017"] -> 1654,425
["1654","vs","427","20-02-2017"] -> 1654,427
["1654","ss","425","14-02-2017"] -> 1654,425

从上面的元素中可以看出1654,425是重复的(因为我想找到基于第一和第三个元素的重复)

所以从上面的列表中,这个列表是重复的

["1654","raj","425","16-02-2017"] -> 1654,425
["1654","ss","425","20-02-2017"] -> 1654,425
["1654","ss","425","14-02-2017"] -> 1654,425

现在从这个列表中我们必须删除2个具有较旧日期的元素(列表的最后一个元素是日期)

此2列表的日期较旧,因此应将其删除

["1654","raj","425","16-02-2017"] -> 1654,425
["1654","ss","425","14-02-2017"] -> 1654,425

结果应该是这个

find_dup = {"one":[["1654","mo","426","20-02-2017"],["1654","ss","425","20-02-2017"],["1654","vs","427","20-02-2017"]]}

我有python脚本遍历列表,但如果找到重复并替换最新日期,我找不到如何弹出元素的逻辑

这是我的脚本失败

find_dup = {"one":[["1654","raj","425","16-02-2017"],["1654","mo","426","20-02-2017"],["1654","ss","425","20-02-2017"],["1654","vs","427","20-02-2017"],["1654","ss","425","14-02-2017"]]}


for d in find_dup:
    len_d = len(find_dup[d])
    store_array_dup = []
    store_array_ele = {}
    for i in find_dup[d]:

        val = i[0]+"-"+i[1]"-"+i[2]"-"+i[3]
        val_1 = i[0]+"-"+i[2]
        if val_1 in store_array_dup:
            store_array_ele.append(val_1)
        else:
            arrs = []
            arrs.append(val)
            store_array_ele[d] = arrs

我怎样才能产生这个结果

find_dup = {"one":[["1654","mo","426","20-02-2017"],["1654","ss","425","20-02-2017"],["1654","vs","427","20-02-2017"]]}

3 个答案:

答案 0 :(得分:1)

我建议按元组(第一个元素,第三个元素和日期)对列表进行排序,先保留最小的日期,然后按第一个和第三个元素对排序列表进行分组,最后从每个子组中选择第一个元素:

set /?

答案 1 :(得分:0)

这是您的数据集:

find_dup = {"one":[
                      ["1654","raj","425","16-02-2017"],
                      ["1654","mo","426","20-02-2017"],
                      ["1654","ss","425","20-02-2017"],
                      ["1654","vs","427","20-02-2017"],
                      ["1654","ss","425","14-02-2017"]
                   ]
            }

您可以使用基于第一个和第三个元素的新键在数据集上创建一个新的dict,并按日期对其进行排序:

from datetime import datetime
lst = sorted(find_dup['one'] , key=lambda x: datetime.strptime(x[3], "%d-%m-%Y"))

new_dict = {(item[0], item[2]): item for item in lst}

print(new_dict)

输出:

>>> print(new_dict.values())
[['1654', 'vs', '427', '20-02-2017'], ['1654', 'mo', '426', '20-02-2017'], ['1654', 'ss', '425', '20-02-2017']]

答案 2 :(得分:0)

首先解决列表清单的问题:

def mounarajan_no_dup(l):
    dedup = {}
    for i in l:
        k = (i[0], i[3])
        if k not in dedup:
            dedup[k] = i
        else :
            j3 = dedup[k][3]
            jdate = j3[6:10] + j3[3:5] + j3[0:2]
            i3 = i[3]
            idate = i3[6:10] + i3[3:5] + i3[0:2]
            if jdate < idate:
                dedup[k] = i
    return dedup.values()

然后将其应用于find_dup的每个条目。

find_dup = {
    "one":[
        ["1654","raj","425","16-02-2017"],
        ["1654","mo","426","20-02-2017"],
        ["1654","ss","425","20-02-2017"],
        ["1654","vs","427","20-02-2017"],
        ["1654","ss","425","14-02-2017"]]}
for d in find_dup:
    find_dup[d] = mounarajan_no_dup(find_dup[d])
find_dup
{'one': [['1654', 'ss', '425', '14-02-2017'], ['1654', 'raj', '425', '16-02-2017'], ['1654', 'mo', '426', '20-02-2017']]}