在两个文件中找到唯一的句子

时间:2016-12-01 04:49:09

标签: python python-2.7 python-3.x pattern-matching difflib

我有两个文件,我正在尝试在两个文件之间打印唯一的句子。为此,我在python中使用difflib。

text ='Physics is one of the oldest academic disciplines. Perhaps the oldest through its inclusion of astronomy. Over the last two millennia. Physics was a part of natural philosophy along with chemistry.'
text1 ='Physics is one of the oldest academic disciplines. Physics was a part of natural philosophy along with chemistry. Quantum chemistry is a branch of chemistry.'
import difflib

differ = difflib.Differ()
diff = differ.compare(text,text1)
print '\n'.join(diff)

它没有给我想要的输出。它给我这样的。

  P
  h
  y
  s
  i
  c
  s

  i
  s

  o
  n
  e

  o
  f

  t
  h
  e

我想要的输出只是两个文件之间的唯一句子。

  

text =也许是最古老的通过它包含的天文学。过了   过去两千年。

     

text1 =量子化学是化学的一个分支。

似乎difflib.Differ一行一行而不是句子。有任何建议请。我怎么能这样做?

1 个答案:

答案 0 :(得分:1)

正如DZinoviev所述,您将字符串传递给需要列表的函数。您不需要使用NLTK,而是可以通过分割句点将您的字符串转换为句子列表。

import difflib

text1 ="""Physics is one of the oldest academic disciplines. Perhaps the oldest through its inclusion of astronomy. Over the last two millennia. Physics was a part of natural philosophy along with chemistry."""
text2 ="""Physics is one of the oldest academic disciplines. Physics was a part of natural philosophy along with chemistry. Quantum chemistry is a branch of chemistry."""

list1 = list(text1.split("."))
list2 = list(text2.split("."))

differ = difflib.Differ()
diff = differ.compare(list1,list2)
print "\n".join(diff)