Question

下载this页面并对其进行非常小的编辑，将此段落中的第一个 65 更改为 68 ：

然后我通过以下代码运行它来拉出差异。

import bs4
from bs4 import BeautifulSoup
import urllib2
import lxml.html as lh
url = 'https://secure.ssa.gov/apps10/reference.nsf/links/02092016062645AM'
response = urllib2.urlopen(url)
content = response.read()  # get response as list of lines
root = lh.fromstring(content)
section1 = root.xpath("//div[@class = 'column-12']")[0]
section1_text = section1.text_content()

url2 = 'file:///Users/Pyderman/repos/02092016062645AM-modified.html'
response2 = urllib2.urlopen(url2)
content2 = response2.read()  # get response as list of lines
root2 = lh.fromstring(content2)
section2 = root2.xpath("//div[@class = 'column-12']")[0]
section2_text = section2.text_content()

d = difflib.Differ()

soup = bs4.BeautifulSoup(unicode(section1_text))
soup2= bs4.BeautifulSoup(unicode(section2_text))

from nltk import sent_tokenize

sentences = [sentence for string in soup.stripped_strings for sentence in sent_tokenize(string)]
sentences2 = [sentence for string in soup2.stripped_strings for sentence in sent_tokenize(string)]

diff = d.compare(sentences, sentences2)
changes = [change for change in diff if change.startswith('-') or  change.startswith('+')]
for change in changes:
    print(change)

打印更改提供：

- It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).
+ It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).

因此，更改会标记为 + ，无论是新增加的（全新的句子也标有 + ）或未成年人改为现有的句子。因此，除非我的程序执行一些额外的处理，否则它会认为添加了一个新句子而删除了另一个句子。

我们如何利用difflib所看到的明显被删除的事实？句子和显然已经添加的＆＃39;句子是非常相似的，以确定我们实际上处理现有句子的就地变更？

注意：解决方案需要能够在一个页面中处理可能的多项更改，因此它不足以应用if sentence1 very similar to sentence 2: then it's a modification之类的内容，因为那里将有几个差异进行比较和对比。

如何用difflib和nltk来区分添加的句子和更改的句子？

0 个答案: