以下代码可以很好地将仅限内容更改提取到certain type of page:
import bs4
from bs4 import BeautifulSoup
import urllib2
url = 'https://secure.ssa.gov/apps10/reference.nsf/links/02092016062645AM'
response = urllib2.urlopen(url)
content = response.read() # get response as list of lines
#url2 = 'https://secure.ssa.gov/apps10/reference.nsf/links/02092016055519AM'
url2 = 'file:///Users/Pyderman/repos/02092016062645AM-modified.html'
response2 = urllib2.urlopen(url2)
content2 = response2.read() # get response as list of lines
import difflib
d = difflib.Differ()
diffed = d.compare(content, content)
#print('\n'.join(diffed))
soup = bs4.BeautifulSoup(content, "lxml")
soup2= bs4.BeautifulSoup(content2, "lxml")
diff = d.compare(list(soup.stripped_strings), list(soup2.stripped_strings))
changes = [change for change in diff if change.startswith('-') or change.startswith('+')]
for change in changes:
print change
#print('\n'.join(diff))
这个页面本质上是一个文档,因此,我只对检测页面上页脚和菜单下 页面部分的差异感兴趣最佳。我希望很少有像这样的页面上的页脚或菜单的更改,但几天后重新运行差异表明已经进行了细微的更改:
- Potential Entitlement
- Social Security Statement
- American Indians and Alaska Natives
+ American Indians/Alaska Natives
- Asian Americans and Pacific Islanders
+ Asian Americans/Pacific Islanders
- Self-employed
+ Self-Employed
- Awards
+ Digital Government Strategy
+ Open Government
- Podcasts
- Webinars
- Digital Government Strategy
鉴于我已经走了解析整个页面的 BeautifulSoup 路线(而不是用 lxml 解析它的部分),我在这里受到限制吗?在运行 difflib 之前,我是否需要返回并将页面拆分为多个部分(或仅部分//div[@class = 'grid']
)?