Question

我正在处理表中的数据（当然是使用熊猫）。我面临以下问题：我想将某些产品与标签关联。例如，“辣椒粉100克”或“饮食可乐2L”。但是，有些项目不符合标准，例如，代替了“ 2升瓶装可乐饮食”，而不是“ 2升可乐”。我想使用一个函数来读取字符串“ 2垃圾瓶可乐饮食”，检查标准标签列表，并使用正确的标准标签对字符串进行分类。我尝试使用来自difflib的功能SequenceMatcher，但它仅达到我拥有的全部产品的3/4。有更好的“ pythonic”解决方案吗？

Answer 1

我认为这里的问题不是关于语言，而是关于字符串距离度量和标记。例如，如果标签上显示“ Diet Coke 2L”，您是否将其与一个令牌字符串“ Coke”或两个令牌字符串“ Diet Coke”进行匹配？假设您已经确定了要匹配的令牌数量，那么建议您使用水母库并使用诸如Levenshtein Distance之类的距离度量。

作为一个代码示例：

from jellyfish import levenshtein_distance

label=“Diet Coke 2 Liter”
match_labels=[“Sprite”,”Coke”,”Pepsi”]

# Split string into length one tokens
label_split=label.split()

#Tolerance for matches
match_tol=1 #Match if at most one letter is different

# Loop through each word, if match then break
match_tuple=[]
for word in label_split:
  for match in match_labels:
    if levenshtein_distance(word,match)<=match_tol:
      match_tuple.append((match_labels,word,match))
      break

Answer 2

证明我发现的最佳解决方案是使用这种小型机器学习

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances

corpus = [x,y,z,w] #x is the string we are trying to classify as one of the labels y,z or w.

vectorizer = CountVectorizer()
features = vectorizer.fit_transform(corpus).todense()

for f in features:
    print(euclidean_distances(features[0],f))

然后，我们选择较小的距离并获得最佳标签。对于我的问题，列表8256个字符串的命中率接近100％。

检查字符串之间的相似性

2 个答案: