如何用difflib和nltk来区分添加的句子和更改的句子?

时间:2016-02-19 19:30:43

标签: python nlp nltk tokenize difflib

下载this页面并对其进行非常小的编辑,将此段落中的第一个 65 更改为 68

enter image description here

然后我通过以下代码运行它来拉出差异。

import bs4
from bs4 import BeautifulSoup
import urllib2
import lxml.html as lh
url = 'https://secure.ssa.gov/apps10/reference.nsf/links/02092016062645AM'
response = urllib2.urlopen(url)
content = response.read()  # get response as list of lines
root = lh.fromstring(content)
section1 = root.xpath("//div[@class = 'column-12']")[0]
section1_text = section1.text_content()

url2 = 'file:///Users/Pyderman/repos/02092016062645AM-modified.html'
response2 = urllib2.urlopen(url2)
content2 = response2.read()  # get response as list of lines
root2 = lh.fromstring(content2)
section2 = root2.xpath("//div[@class = 'column-12']")[0]
section2_text = section2.text_content()

d = difflib.Differ()

soup = bs4.BeautifulSoup(unicode(section1_text))
soup2= bs4.BeautifulSoup(unicode(section2_text))

from nltk import sent_tokenize

sentences = [sentence for string in soup.stripped_strings for sentence in sent_tokenize(string)]
sentences2 = [sentence for string in soup2.stripped_strings for sentence in sent_tokenize(string)]

diff = d.compare(sentences, sentences2)
changes = [change for change in diff if change.startswith('-') or  change.startswith('+')]
for change in changes:
    print(change)

打印更改提供:

- It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).
+ It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).

因此,更改会标记为 + ,无论是新增加的(全新的句子也标有 + )或未成年人改为现有的句子。因此,除非我的程序执行一些额外的处理,否则它会认为添加了一个新句子而删除了另一个句子。

我们如何利用difflib所看到的明显被删除的事实?句子和显然已经添加的'句子是非常相似的,以确定我们实际上处理现有句子的就地变更?

注意:解决方案需要能够在一个页面中处理可能的多项更改,因此它不足以应用if sentence1 very similar to sentence 2: then it's a modification之类的内容,因为那里将有几个差异进行比较和对比。

0 个答案:

没有答案