Question

我正在处理一个包含可能重复条目的名称数据库，并尝试识别我们有哪两个，但不幸的是格式有点不太理想，有些条目有他们的名字，中间名，姓氏或婚前姓捣碎成一根弦，有些只是第一个和最后一个。

我需要一种方法来看看'John Marvulli'是否与'John Michael Marvulli'匹配并能够对这些比赛进行操作。但是，如果你尝试：

>>> 'John Marvulli' in 'John Michael Marvulli'
False

返回False。是否有一种简单的方法可以用这种方式比较两个字符串，以查看另一个名称是否包含在另一个名称中？

Answer 1

您需要拆分字符串并查找单个字词：

>>> all(x in 'John Michael Marvulli'.split() for x in 'John Marvulli'.split())
True

Answer 2

我最近发现了difflib模块的强大功能认为这会对你有所帮助：

import difflib

datab = ['Pnk Flooyd', 'John Marvulli',
         'Ld Zeppelin', 'John Michael Marvulli',
         'Led Zepelin', 'Beetles', 'Pink Fl',
         'Beatlez', 'Beatles', 'Poonk LLoyds',
         'Pook Loyds']
print datab
print


li = []
s = difflib.SequenceMatcher()

def yield_ratios(s,iterable):
    for x in iterable:
        s.set_seq1(x)
        yield s.ratio()

for text_item in datab:
    s.set_seq2(text_item)
    for gathered in li:
        if any(r>0.45 for r in yield_ratios(s,gathered)):
            gathered.append(text_item)
            break
    else:
        li.append([text_item])


for el in li:
    print el

结果

['Pnk Flooyd', 'Pink Fl', 'Poonk LLoyds', 'Pook Loyds']
['John Marvulli', 'John Michael Marvulli']
['Ld Zeppelin', 'Led Zepelin']
['Beetles', 'Beatlez', 'Beatles']

Answer 3

import re

n1 = "john Miller"
n1 = "john   Miller"

n2 = "johnas Miller"

n3 = "john doe Miller"
n4 = "john doe paul Miller"


regex = "john \\s*(\\w*\\s*)*\\s* Miller"
compiled=re.compile(regex)

print(compiled.search(n1)==None)
print(compiled.search(n2)==None)
print(compiled.search(n3)==None)
print(compiled.search(n4)==None)

'''
output:


False
True
False
False
'''

Python“in”不同字长的字符串比较

3 个答案: