使用difflib&amp ;;忽略页面某些部分的内容差异。 BeautifulSoup

时间:2016-02-19 15:08:18

标签: python html web-scraping beautifulsoup difflib

以下代码可以很好地将仅限内容更改提取到certain type of page

import bs4
from bs4 import BeautifulSoup
import urllib2
url = 'https://secure.ssa.gov/apps10/reference.nsf/links/02092016062645AM'
response = urllib2.urlopen(url)
content = response.read()  # get response as list of lines

#url2 = 'https://secure.ssa.gov/apps10/reference.nsf/links/02092016055519AM'
url2 = 'file:///Users/Pyderman/repos/02092016062645AM-modified.html'
response2 = urllib2.urlopen(url2)
content2 = response2.read()  # get response as list of lines
import difflib
d = difflib.Differ()

diffed = d.compare(content, content)
#print('\n'.join(diffed))

soup = bs4.BeautifulSoup(content, "lxml")
soup2= bs4.BeautifulSoup(content2, "lxml")
diff = d.compare(list(soup.stripped_strings), list(soup2.stripped_strings))
changes = [change for change in diff if change.startswith('-') or  change.startswith('+')]
for change in changes:
    print change
#print('\n'.join(diff))

这个页面本质上是一个文档,因此,我只对检测页面上页脚和菜单下 页面部分的差异感兴趣最佳。我希望很少有像这样的页面上的页脚或菜单的更改,但几天后重新运行差异表明已经进行了细微的更改:

- Potential Entitlement
- Social Security Statement
- American Indians and Alaska Natives
+ American Indians/Alaska Natives
- Asian Americans and Pacific Islanders
+ Asian Americans/Pacific Islanders
- Self-employed
+ Self-Employed
- Awards
+ Digital Government Strategy
+ Open Government
- Podcasts
- Webinars
- Digital Government Strategy

鉴于我已经走了解析整个页面的 BeautifulSoup 路线(而不是用 lxml 解析它的部分),我在这里受到限制吗?在运行 difflib 之前,我是否需要返回并将页面拆分为多个部分(或仅部分//div[@class = 'grid'])?

0 个答案:

没有答案