给定两个字符串,我希望能够 - 在Python中 - 能够确定哪些单词已被添加以及哪些单词已被删除。我见过difflib,但显然它无法做到。
例如:给'你好我的名字'和'你好我的家伙',它会将['guys']作为附加单词返回,并将['name']作为删除的单词返回。非常感谢。
编辑:可能我给出的例子不是最好的。它也应该在当前文本和新文本之间没有空格的情况下工作。也许使用difflib来获取所有新的部分,然后使用regexp“\ b”进行拆分。我试试看。答案 0 :(得分:1)
首先要记住的是python,它包含“电池”。这意味着您应该在标准库中查找工具以执行您需要的操作,然后再自行重新创建它。
更强大的技术是重用difflib.SequenceMatcher来查找字符串的差异。例如:
import difflib
before = 'hello my name is'
after = 'hello my guys is'
def isjunk(string):
"Return True if we don't care about this string"
return string == ' '
s = difflib.SequenceMatcher(isjunk)
s.set_seqs(before, after)
for (
opcode,
before_start, before_end,
after_start, after_end
) in s.get_opcodes():
if opcode == 'equal':
# We don't care.
continue
print "%7s '%s' -> '%s'" % (
opcode,
before[before_start:before_end],
after[after_start:after_end],
)
这会生成此输出,显然可以自定义以完全满足您的需要:
replace 'name' -> 'guys'
答案 1 :(得分:0)
before = "hello my name is"
after = "hello my guy is test"
before = before.split(' ')
after = after.split(' ')
for item in after:
if not item in before:
print item
答案 2 :(得分:0)
这不是特别漂亮,但似乎适用于我能想到的大多数情况。我相信这也可以整理很多,并且很容易使用不区分大小写。
def freqs(list):
words = {}
for word in list:
words[word] = words.get(word, 0) + 1
return words
def added_and_removed(a, b):
af = freqs(a.split())
bf = freqs(b.split())
removed = []
added = []
for key in af:
num = bf.get(key)
if num == None:
if af[key] > 1:
words = [key]*af[key]
removed.extend(words)
else:
removed.append(key)
for key in bf:
num = af.get(key)
if num == None:
added.append(key)
elif num > 1:
words = [key]*(num-1)
removed.extend(words)
return added, removed
a = 'hello hello hello my name is Dave dave bar foo'
b = 'hello my guys is test easy rob dave beef foo'
added, removed = added_and_removed(a, b)
print added
print removed
给出
['beef', 'rob', 'easy', 'test', 'guys']
['bar', 'name', 'Dave', 'hello', 'hello']