Question

我希望获得this页面的内容（结构变化很少，因此可以忽略）的可靠差异。更具体地说，我需要接受的唯一更改是添加了新的指令ID：

要了解 difflib 会产生什么，我首先要区分两个相同的 HTML内容，希望得不到任何回报：

renderNotification

由于 difflib 模仿UNIX url = 'https://secure.ssa.gov/apps10/reference.nsf/instructiontypecode!openview&restricttocategory=POMT' response = urllib.urlopen(url content = response.read() import difflib d = difflib.Differ() diffed = d.compare(content, content)实用程序，我希望diff不包含任何内容（或者说明序列是相同的，但是如果我{{{ 1}} diffed，我得到something resembling HTM L，（虽然它不会在浏览器中呈现）

的确，如果我采用最简单的方法来区分两个字符：

'\n'.join = d.compare（'a'，'a'）

diffed产生以下内容：

diffed

所以我或者期待 difflib 的某些东西，它不能或不会提供（我应该更改大头钉），或者我滥用它？什么是差异化HTML的可行替代方案？

Answer 1

Differ.compare()的参数应该是字符串序列。如果你使用两个字符串，它们将被视为序列，因此逐个字符进行比较。

所以你的例子应该改写为：

url = 'https://secure.ssa.gov/apps10/reference.nsf/instructiontypecode!openview&restricttocategory=POMT'
response = urllib.urlopen(url)
content = response.readlines()  # get response as list of lines
import difflib
d = difflib.Differ()

diffed = d.compare(content, content)
print('\n'.join(diffed))

如果您只想比较html文件的内容，您应该使用解析器来处理它，并且只获取没有标记的文本，例如使用BeautifulSoup的soup.stripped_strings：

...
soup = bs4.BeautifulSoup(html_content)
diff = d.compare(list(soup.stripped_strings), list_to_compare_to)
print('\n'.join(diff))
...

将HTML与difflib进行比较

1 个答案: