在两个字符串之间添加和删除单词

时间:2012-04-09 14:22:09

标签: python diff

给定两个字符串,我希望能够 - 在Python中 - 能够确定哪些单词已被添加以及哪些单词已被删除。我见过difflib,但显然它无法做到。

例如:给'你好我的名字'和'你好我的家伙',它会将['guys']作为附加单词返回,并将['name']作为删除的单词返回。非常感谢。

编辑:可能我给出的例子不是最好的。它也应该在当前文本和新文本之间没有空格的情况下工作。也许使用difflib来获取所有新的部分,然后使用regexp“\ b”进行拆分。我试试看。

3 个答案:

答案 0 :(得分:1)

首先要记住的是python,它包含“电池”。这意味着您应该在标准库中查找工具以执行您需要的操作,然后再自行重新创建它。

更强大的技术是重用difflib.SequenceMatcher来查找字符串的差异。例如:

import difflib

before = 'hello my name is'
after = 'hello my guys is'

def isjunk(string):
    "Return True if we don't care about this string"
    return string == ' '


s = difflib.SequenceMatcher(isjunk)
s.set_seqs(before, after)

for (
        opcode,
        before_start, before_end,
        after_start, after_end
) in s.get_opcodes():
    if opcode == 'equal':
        # We don't care.
        continue

    print "%7s '%s' -> '%s'" % (
            opcode,
            before[before_start:before_end],
            after[after_start:after_end],
    ) 

这会生成此输出,显然可以自定义以完全满足您的需要:

replace 'name' -> 'guys'

答案 1 :(得分:0)

before = "hello my name is"
after = "hello my  guy is test"


before = before.split(' ')
after = after.split(' ')

for item in after:
    if not item in before:
        print item

答案 2 :(得分:0)

这不是特别漂亮,但似乎适用于我能想到的大多数情况。我相信这也可以整理很多,并且很容易使用不区分大小写。

def freqs(list):
    words = {}
    for word in list:
        words[word] = words.get(word, 0) + 1
    return words

def added_and_removed(a, b):
    af = freqs(a.split())
    bf = freqs(b.split())

    removed = []
    added = []

    for key in af:
        num = bf.get(key)
        if num == None:
            if af[key] > 1:
                words = [key]*af[key]
                removed.extend(words)
            else:
                removed.append(key)

    for key in bf:
        num = af.get(key)
        if num == None:
            added.append(key)
        elif num > 1:
            words = [key]*(num-1)
            removed.extend(words)

    return added, removed

a = 'hello hello hello my name is Dave dave bar foo'
b = 'hello my guys is test easy rob dave beef foo'     

added, removed =  added_and_removed(a, b)
print added
print removed

给出

['beef', 'rob', 'easy', 'test', 'guys']
['bar', 'name', 'Dave', 'hello', 'hello']