如何根据多个字符串列表找到字符串的匹配项

时间:2018-05-04 10:26:14

标签: python string-matching

我有一组字符串,我想要的是找出输入字符串与现有字符串集的匹配。这是场景: 我确实有预定义的字符串列表,如:[Intel,Windows,Google] 输入字符串将如下:

'Intel(R) software'

'Intel IT'

'IntelliCAD Technology Consortium'

'Huaian Ningda intelligence Project co.,Ltd'

'Intellon Corporation'

'INTEL\Giovanni'

'Internal - Intel® Identity Protection Technology Software'


'*.google.com'

'GoogleHit'

'http://www.google.com'

'Google Play - Olmsted County'

'Microsoft Windows Component Publisher'

'Microsoft Windows 2000 Publisher'

'Microsoft Windows XP Publisher'

'Windows Embedded Signer'

'Windows Corporation'

'Windows7-PC\Windows7'

有人可以建议我使用ML算法或其他一些改动来达到最大匹配百分比。 首选语言是Python。

2 个答案:

答案 0 :(得分:0)

您可以使用difflib

import difflib

a = ['apple', 'ball', 'pen']
b = ['appel', 'blla', 'epn']

[(i, difflib.get_close_matches(i, a)[0]) for i in b]

输出:

[('appel', 'apple'), ('blla', 'ball'), ('epn', 'pen')]

要查找相似性百分比,您可以使用SequenceMatcher,如here所述。

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

E.g。

>>> similar("Apple","Appel")
0.8

答案 1 :(得分:0)

使用re模块

import re

love = ['Intel(R) software',

'Intel IT',

'IntelliCAD Technology Consortium',

'Huaian Ningda intelligence Project co.,Ltd',

'Intellon Corporation',

'INTEL\Giovanni',

'Internal - Intel® Identity Protection Technology Software',

'*.google.com',

'GoogleHit',

'http://www.google.com',

'Google Play - Olmsted County',

'Microsoft Windows Component Publisher',

'Microsoft Windows 2000 Publisher',

'Microsoft Windows XP Publisher',

'Windows Embedded Signer',

'Windows Corporation',

'Windows7-PC\Windows7']

match = {}
counts = {}

regex_words = ['Intel', 'Windows', 'Google']
no = 0

# for each of the predefined words
for x in regex_words:
    # new regex we will use for a closer match
    regex = '\s?' + x + '\s'

    # items we want to match
    for each in love:
        found = re.findall(x, each)
        if found:

            # counting them to get the maximum, (ran out of time)
            counts[no] = len(found)

            # here is a closer match, matching with space in front
            if re.findall(regex, each):
                per = 0.5
                match[each] = str(per)

            # this is an exact match
            elif each == x:
                per = 0.75
                match[each] = str(per)

            # this is the very first match the ordinary
            else:
                per = 0.25
                match[each] = str(per)

        no += 1

""" This is the calculation of the score the item made
for the it's repeatition against the set """

# this will be the mode of the counts
highest = 0

# start working on the counts
for y in counts:

    # if this is higher than whats already in the highest
    if counts[y] > highest:

        # make it the highest
        highest = counts[y]

# index for counts dict
small_no = 0
for z in match:

    # percentage of what was in the counts for the item compared to the highest
    per = counts[small_no] / highest * 100

    # percentage the item gets for the remaining 25 score allocated to all
    score = per / 100 * 25
    total_score = round((score / 100), 2) 

    # increment the no. that we are using to iterate the counts
    small_no += 1

    # reset the new score for the matchs
    match[z] = str(float(match[z]) + total_score)

将输出

{'Intel(R) software': '0.37', 'Intel IT': '0.62', 'IntelliCAD Technology Consortium': '0.37', 'Intellon Corporation': '0.37', 'Internal - Intel® Identity Protection Technology Software': '0.37', 'Microsoft Windows Component Publisher': '0.62', 'Microsoft Windows 2000 Publisher': '0.62', 'Microsoft Windows XP Publisher': '0.62', 'Windows Embedded Signer': '0.62', 'Windows Corporation': '0.62', 'Windows7-PC\\Windows7': '0.5', 'GoogleHit': '0.37', 'Google Play - Olmsted County': '0.62'