下载this页面并对其进行非常小的编辑,将此段落中的第一个 65 更改为 68 :
然后我通过以下代码运行它来拉出差异。
import bs4
from bs4 import BeautifulSoup
import urllib2
import lxml.html as lh
url = 'https://secure.ssa.gov/apps10/reference.nsf/links/02092016062645AM'
response = urllib2.urlopen(url)
content = response.read() # get response as list of lines
root = lh.fromstring(content)
section1 = root.xpath("//div[@class = 'column-12']")[0]
section1_text = section1.text_content()
url2 = 'file:///Users/Pyderman/repos/02092016062645AM-modified.html'
response2 = urllib2.urlopen(url2)
content2 = response2.read() # get response as list of lines
root2 = lh.fromstring(content2)
section2 = root2.xpath("//div[@class = 'column-12']")[0]
section2_text = section2.text_content()
d = difflib.Differ()
soup = bs4.BeautifulSoup(unicode(section1_text))
soup2= bs4.BeautifulSoup(unicode(section2_text))
from nltk import sent_tokenize
sentences = [sentence for string in soup.stripped_strings for sentence in sent_tokenize(string)]
sentences2 = [sentence for string in soup2.stripped_strings for sentence in sent_tokenize(string)]
diff = d.compare(sentences, sentences2)
changes = [change for change in diff if change.startswith('-') or change.startswith('+')]
for change in changes:
print(change)
打印更改提供:
- It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).
+ It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).
因此,更改会标记为 + ,无论是新增加的(全新的句子也标有 + )或未成年人改为现有的句子。因此,除非我的程序执行一些额外的处理,否则它会认为添加了一个新句子而删除了另一个句子。
我们如何利用difflib
所看到的明显被删除的事实?句子和显然已经添加的'句子是非常相似的,以确定我们实际上处理现有句子的就地变更?
注意:解决方案需要能够在一个页面中处理可能的多项更改,因此它不足以应用if sentence1 very similar to sentence 2: then it's a modification
之类的内容,因为那里将有几个差异进行比较和对比。