我有一个SKU名称列表,我需要将缩写解析成单词。
缩写的长度(2-5个字符)不同,但与实际单词的顺序匹配。
几个例子:
SKU名称:“ 235 DSKTP 10LB” ---->“台式机”
SKU名称:“ 222840 MSE 2oz” ---->“鼠标”
其他说明:
我玩过一些正则表达式,但无济于事。
是否存在类似于d?e?s?k?t?o?p?的正则表达式模式?
答案 0 :(得分:0)
import re
from collections import OrderedDict
data = '''
235 DSKTP 10LB
222840 MSE 2oz
1234 WNE 1L
12345 XXX 23L
RND PTT GNCH 16 OZ 007349012845
FRN SHL CNCH 7.05 OZ 007473418910
TWST CLNT 16 OZ 00733544
'''
words = ['Desktop',
'Mouse',
'Tree',
'Wine',
'Gnocchi',
'Shells',
'Cellentani']
def compare(sku_abbr, full_word):
s = ''.join(c for c in full_word if c not in set(sku_abbr) ^ set(full_word))
s = ''.join(OrderedDict.fromkeys(s).keys())
return s == sku_abbr
for full_sku in data.splitlines():
if not full_sku:
continue
for sku_abbr in re.findall(r'([A-Z]{3,})', full_sku):
should_break = False
for w in words:
if compare(sku_abbr.upper(), w.upper()):
print(full_sku, w)
should_break = True
break
if should_break:
break
else:
print(full_sku, '* NOT FOUND *')
打印:
235 DSKTP 10LB Desktop
222840 MSE 2oz Mouse
1234 WNE 1L Wine
12345 XXX 23L * NOT FOUND *
RND PTT GNCH 16 OZ 007349012845 Gnocchi
FRN SHL CNCH 7.05 OZ 007473418910 Shells
TWST CLNT 16 OZ 00733544 Cellentani
答案 1 :(得分:0)
您可以创建将缩写与实际单词相关联的字典:
import re
names = ["235 DSKTP 10LB", "222840 MSE 2oz"]
abbrs = {'DSKTP':'Desktop', 'MSE':'Mouse'}
matched = [re.findall('(?<=\s)[a-zA-Z]+(?=\s)', i) for i in names]
result = ['N/A' if not i else abbrs.get(i[0], i[0]) for i in matched]
输出:
['Desktop', 'Mouse']
答案 2 :(得分:0)
查阅Levenshtein distance-测量“文本的相似性”。
Levenshtein-Implementation: https://en.wikibooks.org/wiki/Algorithm_Implementation的来源
def levenshtein(s1, s2): # source: https://en.wikibooks.org/wiki/Algorithm_Implementation # /Strings/Levenshtein_distance#Python if len(s1) < len(s2): return levenshtein(s2, s1) # len(s1) >= len(s2) if len(s2) == 0: return len(s1) previous_row = range(len(s2) + 1) for i, c1 in enumerate(s1): current_row = [i + 1] for j, c2 in enumerate(s2): insertions = previous_row[j + 1] + 1 deletions = current_row[j] + 1 substitutions = previous_row[j] + (c1 != c2) current_row.append( min(insertions, deletions, substitutions)) previous_row = current_row return previous_row[-1]
适用于您的问题:
skus = ["235 DSKTP 10LB","222840 MSE 2oz"]
full = ["Desktop", "Mouse", "potkseD"]
# go over all skus
for sku in skus:
name = sku.split()[1].lower() # extract name
dist = []
for f in full: # calculate all levenshtein dists to full names
# you could shorten this by only using those
# where 1st character is identicall
dist.append( ( levenshtein(name.lower(),f.lower()),name,f) )
print(dist)
# get the minimal distance (beware if same distances occure)
print( min( (p for p in dist), key = lambda x:x[0]) )
输出:
# distances
[(2, 'dsktp', 'Desktop'), (5, 'dsktp', 'Mouse'), (6, 'dsktp', 'potkseD')]
# minimal one
(2, 'dsktp', 'Desktop')
# distances
[(6, 'mse', 'Desktop'), (2, 'mse', 'Mouse'), (5, 'mse', 'potkseD')]
# minimal one
(2, 'mse', 'Mouse')
如果您有固定的映射,请坐下并手动创建映射字典一次,直到获得新的灵感为止。