Question

以下代码可以很好地将仅限内容更改提取到certain type of page：

import bs4
from bs4 import BeautifulSoup
import urllib2
url = 'https://secure.ssa.gov/apps10/reference.nsf/links/02092016062645AM'
response = urllib2.urlopen(url)
content = response.read()  # get response as list of lines

#url2 = 'https://secure.ssa.gov/apps10/reference.nsf/links/02092016055519AM'
url2 = 'file:///Users/Pyderman/repos/02092016062645AM-modified.html'
response2 = urllib2.urlopen(url2)
content2 = response2.read()  # get response as list of lines
import difflib
d = difflib.Differ()

diffed = d.compare(content, content)
#print('\n'.join(diffed))

soup = bs4.BeautifulSoup(content, "lxml")
soup2= bs4.BeautifulSoup(content2, "lxml")
diff = d.compare(list(soup.stripped_strings), list(soup2.stripped_strings))
changes = [change for change in diff if change.startswith('-') or  change.startswith('+')]
for change in changes:
    print change
#print('\n'.join(diff))

这个页面本质上是一个文档，因此，我只对检测页面上页脚和菜单下 页面部分的差异感兴趣最佳。我希望很少有像这样的页面上的页脚或菜单的更改，但几天后重新运行差异表明已经进行了细微的更改：

- Potential Entitlement - Social Security Statement - American Indians and Alaska Natives + American Indians/Alaska Natives - Asian Americans and Pacific Islanders + Asian Americans/Pacific Islanders - Self-employed + Self-Employed - Awards + Digital Government Strategy + Open Government - Podcasts - Webinars - Digital Government Strategy

鉴于我已经走了解析整个页面的 BeautifulSoup 路线（而不是用 lxml 解析它的部分），我在这里受到限制吗？在运行 difflib 之前，我是否需要返回并将页面拆分为多个部分（或仅部分//div[@class = 'grid']）？

使用difflib＆amp ;;忽略页面某些部分的内容差异。 BeautifulSoup

0 个答案: