Question

我有一个SKU名称列表，我需要将缩写解析成单词。

缩写的长度（2-5个字符）不同，但与实际单词的顺序匹配。

几个例子：

SKU名称：“ 235 DSKTP 10LB” ---->“台式机”

SKU名称：“ 222840 MSE 2oz” ---->“鼠标”

其他说明：

SKU名称并非全为大写字母，尽管我知道使用.upper（）方法可能更容易更改
我需要匹配的单词列表很长（超过100个单词），所以也许创建一个单词列表来匹配某个模式会最有效？

我玩过一些正则表达式，但无济于事。

是否存在类似于d？e？s？k？t？o？p？的正则表达式模式？

Answer 1

import re
from collections import OrderedDict

data = '''
235 DSKTP 10LB
222840 MSE 2oz
1234 WNE 1L
12345 XXX 23L
RND PTT GNCH 16 OZ 007349012845
FRN SHL CNCH 7.05 OZ 007473418910
TWST CLNT 16 OZ 00733544
'''

words = ['Desktop',
'Mouse',
'Tree',
'Wine',
'Gnocchi',
'Shells',
'Cellentani']

def compare(sku_abbr, full_word):
    s = ''.join(c for c in full_word if c not in set(sku_abbr) ^ set(full_word))
    s = ''.join(OrderedDict.fromkeys(s).keys())
    return s == sku_abbr

for full_sku in data.splitlines():
    if not full_sku:
        continue
    for sku_abbr in re.findall(r'([A-Z]{3,})', full_sku):
        should_break = False
        for w in words:
            if compare(sku_abbr.upper(), w.upper()):
                print(full_sku, w)
                should_break = True
                break
        if should_break:
            break
    else:
        print(full_sku, '* NOT FOUND *')

打印：

235 DSKTP 10LB Desktop
222840 MSE 2oz Mouse
1234 WNE 1L Wine
12345 XXX 23L * NOT FOUND *
RND PTT GNCH 16 OZ 007349012845 Gnocchi
FRN SHL CNCH 7.05 OZ 007473418910 Shells
TWST CLNT 16 OZ 00733544 Cellentani

Answer 2

您可以创建将缩写与实际单词相关联的字典：

import re
names = ["235 DSKTP 10LB", "222840 MSE 2oz"]
abbrs = {'DSKTP':'Desktop', 'MSE':'Mouse'}
matched = [re.findall('(?<=\s)[a-zA-Z]+(?=\s)', i) for i in names]
result = ['N/A' if not i else abbrs.get(i[0], i[0]) for i in matched]

输出：

['Desktop', 'Mouse']

Answer 3

查阅Levenshtein distance-测量“文本的相似性”。

Levenshtein-Implementation: https://en.wikibooks.org/wiki/Algorithm_Implementation的来源

def levenshtein(s1, s2):
    # source: https://en.wikibooks.org/wiki/Algorithm_Implementation
    #               /Strings/Levenshtein_distance#Python
    if len(s1) < len(s2):
        return levenshtein(s2, s1)

    # len(s1) >= len(s2)
    if len(s2) == 0:
        return len(s1)

    previous_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1  
            deletions = current_row[j] + 1        
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append( min(insertions, deletions, substitutions))
        previous_row = current_row

    return previous_row[-1]

适用于您的问题：

skus = ["235 DSKTP 10LB","222840 MSE 2oz"]
full = ["Desktop", "Mouse", "potkseD"]

# go over all skus
for sku in skus:
    name = sku.split()[1].lower()       # extract name
    dist = []
    for f in full:                      # calculate all levenshtein dists to full names
                                        # you could shorten this by only using those
                                        # where 1st character is identicall
        dist.append( ( levenshtein(name.lower(),f.lower()),name,f) )

    print(dist)

    # get the minimal distance (beware if same distances occure)
    print( min( (p for p in dist), key = lambda x:x[0]) )

输出：

# distances 
[(2, 'dsktp', 'Desktop'), (5, 'dsktp', 'Mouse'), (6, 'dsktp', 'potkseD')]

# minimal one
(2, 'dsktp', 'Desktop')

# distances
[(6, 'mse', 'Desktop'), (2, 'mse', 'Mouse'), (5, 'mse', 'potkseD')]

# minimal one
(2, 'mse', 'Mouse')

如果您有固定的映射，请坐下并手动创建映射字典一次，直到获得新的灵感为止。

如何对多个字符串进行通配符或正则表达式

3 个答案: