python比较两个字符串,比如在Word中比较两个文档

时间:2017-04-10 03:53:08

标签: python text-analysis

我想在字符级别比较两个段落,看看哪些字词被修改。

要比较的段落:

t1 = '''1 Then was Jesus led up of the Spirit into the wilderness to be tempted of the devil.
2 And when he had fasted forty days and forty nights, he was afterward an hungred.
3 And when the tempter came to him, he said, If thou be the Son of God, command that these stones be made bread.
'''.splitlines(keepends=True)

t2 = '''1 Then Jesus was led up of the Spirit, into the wilderness, to be with God.
2 And when he had fasted forty days and forty nights, and had communed with God, he was afterwards an hungered, and was left to be tempted of the devil,
3 And when the tempter came to him, he said, If thou be the Son of God, command that these stones be made bread.
'''.splitlines(keepends=True)

当我尝试difflib时,它在第一行中效果很好,但它没有检测到第二行的差异。

>>> from difflib import *

>>> d = Differ()
>>> result = list(d.compare(t1,t2))
>>> for i in result:
...     print(i, end='')

结果:

- 1 Then was Jesus led up of the Spirit into the wilderness to be tempted of the devil.
?        ----                                                     ^^^^^^^^^^^  -  ----
+ 1 Then Jesus was led up of the Spirit, into the wilderness, to be with God.
?            ++++                      +                    +       ^^   ++
- 2 And when he had fasted forty days and forty nights, he was afterward an hungred.
+ 2 And when he had fasted forty days and forty nights, and had communed with God, he was afterwards an hungered, and was left to be tempted of the devil,
  3 And when the tempter came to him, he said, If thou be the Son of God, command that these stones be made bread.

只有第一段具有所需的输出。

即使我提取第二行进行比较

t1 = '''2 And when he had fasted forty days and forty nights, he was afterward an hungred.
'''.splitlines(keepends=True)

t2 = '''2 And when he had fasted forty days and forty nights, and had communed with God, he was afterwards an hungered, and was left to be tempted of the devil,
'''.splitlines(keepends=True)
d = Differ()
result = list(d.compare(t1,t2))
for i in result:
    print(i, end='')

结果:

它没有显示正在修改哪个字符,它表明正在修改此行。

- 2 And when he had fasted forty days and forty nights, he was afterward an hungred.
+ 2 And when he had fasted forty days and forty nights, and had communed with God, he was afterwards an hungered, and was left to be
tempted of the devil,

但如果我用SequenceMatcher测试来比较第二行,它似乎可以识别修改过的字符。

p2_1 = '''2 And when he had fasted forty days and forty nights, he was afterward an hungred.'''
p2_2 = '''2 And when he had fasted forty days and forty nights, and had communed with God, he was afterwards an hungered, and was left to be tempted of the devil,'''
se = SequenceMatcher(None,p2_1, p2_2)
se.get_opcodes()

结果:

[('equal', 0, 54, 0, 54),
 ('insert', 54, 54, 54, 81),
 ('equal', 54, 70, 81, 97),
 ('insert', 70, 70, 97, 98),
 ('equal', 70, 78, 98, 106),
 ('insert', 78, 78, 106, 107),
 ('equal', 78, 81, 107, 110),
 ('replace', 81, 82, 110, 152)]

问题:

我如何比较这两段,我可以知道哪个字符被修改?或者我可以使用现有的包吗?

这是我想要的输出

- 1 Then was Jesus led up of the Spirit into the wilderness to be tempted of the devil.
?        ----                                                     ^^^^^^^^^^^  -  ----
+ 1 Then Jesus was led up of the Spirit, into the wilderness, to be with God.
?            ++++                      +                    +       ^^   ++

或类似的东西 enter image description here

0 个答案:

没有答案