Python - json比较使用集仅偶尔工作

时间:2016-08-14 15:31:19

标签: python json set

我手上有一个非常奇怪的问题。我对python没有太多经验(我选择的语言很快就可以用于移动开发),但我要为这个项目做的是从数据库中提取一些csv文件,在本地下载并将它们上传到亚马逊的DynamoDB。 / p>

我设法让一切正常 - 程序将zip文件下载为csv文件,使用zipfile提取,将csv文件转换为json文件,然后开始将json上传到DynamoDB。

但是,这些csv文件每个包含大约100,000行,并且每次在csv文件中仅更改5-10个项目时,每次重新上载每个项目都没有意义。所以,我决定做的是在将新json上传到DynamoDB之前,让程序将新json与旧json进行比较,只获取新项目,然后上传它们。

现在,继续解决实际问题。我一直在尝试的是:

import json

    with open ("C:\\Users\Me\Desktop\staff\oldfile.json") as json1:
        list1 = json.load(json1)
    with open ("C:\\Users\Me\Desktop\staff\newfile.json") as json2:
        list2 = json.load(json2)

set_1 = set(repr(x) for x in list1)
set_2 = set(repr(x) for x in list2)

differences = (set_2 - set_1)
print(differences)

实际上效果很好。如果集合相同,则结果将为set(),或仅包含新的附加项目。

然而

我注意到当我将csv文件转换为json时,集合的顺序在不同文件中的两个对象之间发生变化。例如,在第一个json文件中,对象可能是:

[{"name": "jack", "id": "3100", "photo": "http://imagesdatabase.com/is/image/jack/I_063017263_50_20141112", "category": "male employees", "commissions": "4500", "department": "Beauty > Skincare", "department_id": "709010788", "store_id": "", "additional duties": "5", "spreadsheet": "http://spreadsheetdatabase.com/previpew/01/32100/88/07/709310788.csv", "description": "Jack is a talented young man, has worked with us for over three years and, although initially starting slowly, has worked his way up to becoming the top earner of the month several times.", "join_date": "12/5/2008", "mornings": "YES", "staff_link": "http://staffdatabase.com/244234/654", "show": "NO", "retailers_id": "6017263", "head_id": "2909", "products_sold": "Skincare", "commissions_report": "http://commissionsdatabase.com/jck1/2453"}]

新json文件中的同一对象可能是:

[{"id": "3100", "name": "jack", "photo": "http://imagesdatabase.com/is/image/jack/I_063017263_50_20141112", "category": "male employees", "commissions": "4500", "department": "Beauty > Skincare", "department_id": "709010788", "store_id": "", "additional duties": "5", "spreadsheet": "http://spreadsheetdatabase.com/previpew/01/32100/88/07/709310788.csv", "description": "Jack is a talented young man, has worked with us for over three years and, although initially starting slowly, has worked his way up to becoming the top earner of the month several times.", "join_date": "12/5/2008", "mornings": "YES", "staff_link": "http://staffdatabase.com/244234/654", "show": "NO", "retailers_id": "6017263", "head_id": "2909", "products_sold": "Skincare", "commissions_report": "http://commissionsdatabase.com/jck1/2453"}]

这些都是同一个对象,不是吗?

但是当我尝试使用python比较这两个时,有时我会得到set(),有时它会试图告诉我它是一个新对象 - 发生了什么?

json comparison fail

老实说,我已经对这一天进行了近一天的故障排除了,而且我几乎完全按照我的智慧结束了 - 我真的无法理解为什么它在我运行一次时会起作用,而不是下一次与完全相同的json对象。任何帮助将不胜感激!

1 个答案:

答案 0 :(得分:2)

您的代码依赖于词典的排序。字典顺序取决于insertion and deletion history,由于哈希随机化而在Python解释器运行之间有所不同,不应该依赖它。

如果您的词典没有嵌套,您可以将它们存储为集合作为其键值对的元组,排序:

set_1 = set(tuple(sorted(x.items())) for x in list1)
set_2 = set(tuple(sorted(x.items())) for x in list2)

这会创建一个不可变的表示形式,它保留原始的键值配对,但避免了任何排序问题。这些元组可以简单地反馈到dict()类型以重新创建字典。