difflib方法中的不同结果python

时间:2018-12-19 19:13:20

标签: python difflib

我正在使用pythons difflib包来检测Wikipedia文章修订的修订。在调试脚本时,我发现一些错误仅在difflib.Differ().compare()显然正确地检测到更改时使用difflib.HtmlDiff().make_file()时才会发生。不幸的是,我无法将问题分解为比实际文本更简单的示例:

 #!/usr/bin/python3
import requests
import difflib

#get wikipedia revisions via API
PARAMS = { "action": "query", "prop": "revisions", "titles": "Motín del 2 de agosto de 1810", "rvprop": "timestamp|user|comment|content", "rvslots": "main", "formatversion": "2", "revdir": "older", "rvlimit":"2", "rvstart":"2017-04-15T09:41:13Z", "format": "json" }

S = requests.Session()
data = S.get(url="https://es.wikipedia.org/w/api.php", params=PARAMS).json()
textA = data['query']['pages'][0]['revisions'][1]['slots']['main']['content']
textB = data['query']['pages'][0]['revisions'][0]['slots']['main']['content']

#compare texts
d = difflib.Differ()
result = list(d.compare(textA, textB))
# find added characters in result
added = [idx for idx,val in enumerate(result) if val.startswith("+")]
print(len(added))

# print added characters 
# print(''.join([val[2:] for idx,val in enumerate(result) if val.startswith("+")])) 

#write files
fileA = open('textA', "w")
fileA.write(textA)
fileA.close()

fileB = open('textB', "w")
fileB.write(textB)
fileB.close()

print(len(added))返回26061,指示已将太多字符添加到文本中(这是不正确的)。通过here这样python diff.py -m textA textB > result.html提供的命令行工具比较文本时,我确实获得了正确的结果(仅是微小的更改),如此屏幕截图所示。

enter image description here

通过研究代码,我了解到命令行实用程序在行级别进行比较(第一个?),但是无论如何,检测到的更改都在字符级别。我该如何重现它以获得适当的字符级增量,例如difflib.Differ()。compare()返回?

0 个答案:

没有答案