Question

我正在通过Python处理一些复杂的数据集。我是Python编程的新手。数据集是日期，标题，内容和URL的集合。

从概念上讲，它会是这样的。

1st scraping runs, then I get,

[9/6 9:00, title1, content1]
[9/6 9:00, title2, content2]
[9/6 8:22, title3, content3]
[9/6 11:01, title4, content4]
...

2nd scraping runs, then I get,

[9/6 13:05, title5, content5]
[9/6 12:13, title6, content6]
[9/6 9:00, title1, content1]
[9/6 14:21, title4, content4'] ---> This is updated of content4
...

我可以运行抓码。我想要做的是比较第一次刮擦运行和第二次刮擦的输出。我希望只展示差异。

[9/6 13:05, title5, content5]
[9/6 12:13, title6, content6]
[9/6 10:21, title4', content4']

我不相信我必须比较“内容”。我只能通过“日期”和“标题”获得差异。

我花了好几个小时但却想不出优雅的方法让这项工作成真.. 这里最好的方法是什么？基本上，我想将输出存储为pickle然后比较第二次刮擦运行输出。但是，我不确定如何比较同时获取列表的两个元素然后与第二个列表中的两个元素进行比较。循环似乎不简单......

或者，这可以通过dict来完成吗？我不这么认为......但欢迎提出任何建议。

如果有经验的人可以评论，我们将非常感激。

Answer 1

尝试使用它来比较python 3中的list：

a= [['9/6 9:00', 'title1', 'content1'],
['9/6 9:00', 'title2', 'content2'],
['9/6 8:22', 'title3', 'content3'],
['9/6 11:01','title4', 'content4']]
b=[['9/6 13:05', 'title5', 'content5'],
['9/6 12:13', 'title6', 'content6'],
['9/6 9:00', 'title1', 'content1'],
['9/6 14:21', 'title4', 'content4']]
for i in b:
    if i not in a:
        print(i)

输出：

['9/6 13:05', 'title5', 'content5']
['9/6 12:13', 'title6', 'content6']
['9/6 14:21', 'title4', 'content4']

此处它直接将整个列表与['9/6 11:01','title4', 'content4']到['9/6 14:21', 'title4', 'content4']等其他列表进行比较，因此，如果list中的任何单个元素不同，则会显示list，但如果您需要要将list的不同元素与另一个list中的另一个元素进行比较，则必须应用另一种方法。

替代方法（使用列表理解的情况相同）：

print(*[i for i in b if i not in a],sep='\n')

它也会提供相同的输出：

['9/6 13:05', 'title5', 'content5']
['9/6 12:13', 'title6', 'content6']
['9/6 14:21', 'title4', 'content4']

此处列表理解部分仅为[i for i in b if i not in a] 其他sep='\n'用于显示下一行的每个元素。对于理解列表理解请参阅此文档：Python List Comprehensions: Explained Visually

如果你告诉我们有什么区别可以打印，那么我可以提供帮助，因为有问题我不明白我们如何得到 9/6 10:21 这行[9/6 10:21, title4', content4']的输出

Answer 2

你有没有尝试过类似的东西？

>>> common_elements = []
>>> a = [['date', 'title1', 'content1'], ['date2', 'title2', 'content2']]
>>> b = [['date3', 'title3', 'content3'], ['date2', 'title2', 'content2']]
>>> for element in a:
...     if element in b:
...         common_elements.append(element)
... 
>>> common_elements
[['date2', 'title2', 'content2']]

Answer 3

a = [['9/6 9:00', 'title1', 'content1'],
     ['9/6 9:00', 'title2', 'content2'],
     ['9/6 8:22', 'title3', 'content3'],
     ['9/6 11:01','title4', 'content4']]
b = [['9/6 13:05', 'title5', 'content5'],
     ['9/6 12:13', 'title6', 'content6'],
     ['9/6 9:00', 'title1', 'content1'],
     ['9/6 14:21', 'title4', 'content4']]

[i for i in b if i not in a]

您也可以使用生成器表达式。

比较列表或词典的最佳算法

3 个答案: