在python中对类似单词进行分组的最佳方法是什么?

时间:2015-10-01 10:09:59

标签: python algorithm numpy data-mining

我试图按照相似性对单词列表进行分组。我发现这个interesting issue与这个主题有关,并尝试在我的单词数组中实现Affinity Propagation算法但是我觉得输出很差。

如何改进此算法?

import numpy as np
import scipy.linalg as lin
import Levenshtein as leven
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.cluster import AffinityPropagation
import itertools

words = np.array(['ice cubes',
 'club soda',
 'white rum',
 'lime',
 'turbinado',
 'egg',
 'hearts of palm',
 'cilantro',
 'coconut cream',
 'flax seed meal',
 'kosher salt',
 'jalapeno chilies',
 'garlic',
 'cream cheese soften',
 'coconut oil',
 'lime juice',
 'crushed red pepper flakes',
 'ground coriander',
 'pepper',
 'chicken breasts',
 'coconut flour',
 'onion',
 'sweetened condensed milk',
 'butter',
 'cocoa powder',
 'lime',
 'crushed ice',
 'simple syrup',
 'cachaca',
 'sugar',
 'corn starch',
 'egg whites',
 'boiling water',
 'cold water',
 'egg yolks',
 'sweetened condensed milk',
 'milk',
 'jell-o gelatin dessert',
 'olive oil',
 'low sodium chicken broth',
 'cilantro leaves',
 'chile powder',
 'fresh thyme',
 'chile pepper',
 'sweet paprika',
 'sablefish',
 'brown rice',
 'yellow onion',
 'low-fat coconut milk',
 'roma tomatoes',
 'garlic',
 'fresh lime juice',
 'egg',
 'grating cheese',
 'milk',
 'tapioca flour',
 'salt',
 'olive oil',
 'coconut milk',
 'frozen banana',
 'pure acai puree',
 'almond butter',
 'kosher salt',
 'dijon mustard',
 'sweet paprika',
 'boneless skinless chicken breast halves',
 'caraway seeds',
 'ground black pepper',
 'lime wedges',
 'chopped cilantro',
 'lager beer',
 'peeled fresh ginger',
 'garlic cloves',
 'green bell pepper',
 'unsalted butter',
 'vegetable oil',
 'onion',
 'egg',
 'whole milk',
 'extra-virgin olive oil',
 'garlic cloves',
 'corn kernels',
 'chicken breasts',
 'all-purpose flour',
 'cream cheese soften',
 'celery ribs'])

print("calculating distances...")

(dim,) = words.shape

f = lambda x_y: -leven.distance(x_y[0],x_y[1])

res=np.fromiter(map(f, itertools.product(words, words)), dtype=np.uint8)
A = np.reshape(res,(dim,dim))

af = AffinityPropagation().fit(A)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_

# Distances had to be converted to similarities, I did that by taking the negative of distance. The output is

unique_labels = set(labels)
for i in unique_labels:
    print(words[labels==i])

以下是我得到的输出:

calculating distances...
['lime' 'lime']
['egg' 'egg' 'egg']
['cream cheese soften' 'cream cheese soften']
['sweetened condensed milk' 'sweetened condensed milk']
['milk' 'milk']
['olive oil']
['turbinado' 'hearts of palm' 'pepper' 'crushed ice' 'sugar' 'egg whites'
 'boiling water' 'egg yolks' 'jell-o gelatin dessert' 'cilantro leaves'
 'chile powder' 'fresh thyme' 'chile pepper' 'sablefish' 'brown rice'
 'low-fat coconut milk' 'fresh lime juice' 'grating cheese' 'tapioca flour'
 'dijon mustard' 'caraway seeds' 'lime wedges' 'lager beer'
 'peeled fresh ginger' 'green bell pepper' 'vegetable oil'
 'all-purpose flour']
['garlic' 'garlic']
['olive oil']
['kosher salt' 'kosher salt']
['sweet paprika' 'sweet paprika']
['garlic cloves' 'garlic cloves']
['ice cubes' 'club soda' 'white rum' 'cilantro' 'coconut cream'
 'flax seed meal' 'jalapeno chilies' 'coconut oil' 'lime juice'
 'crushed red pepper flakes' 'ground coriander' 'coconut flour' 'onion'
 'butter' 'cocoa powder' 'simple syrup' 'cachaca' 'corn starch'
 'cold water' 'low sodium chicken broth' 'yellow onion' 'roma tomatoes'
 'salt' 'coconut milk' 'frozen banana' 'pure acai puree' 'almond butter'
 'boneless skinless chicken breast halves' 'ground black pepper'
 'chopped cilantro' 'unsalted butter' 'onion' 'whole milk'
 'extra-virgin olive oil' 'corn kernels' 'celery ribs']
['chicken breasts' 'chicken breasts']

正如您所看到的,分组并不是那么好。我希望所有的单词都带有“盐”字样。例如被组合在一起。同样,我希望新鲜的青柠汁'与' lime'分组或者' lime wedges'

由于

3 个答案:

答案 0 :(得分:1)

你的话不是我想的话,而是句子。您可以先检查这些部分,然后进行预先分组。

答案 1 :(得分:1)

您正在对句子进行分组而不是单词。根据您想要偏差的方式,您可以将每个句子分成单词并计算句子之间的分数:

  1. 任何单词对之间的最小Levenshtein距离
  2. 或者对于较短句中的每个单词,找到距离较长句子的单词的最小Levenshtein距离,则得分为:

    1. 第二小距离
    2. 中位距离
    3. 平均距离
    4. 还有许多其他可能性。

答案 2 :(得分:0)

你在角色上使用Levenshtein距离。

要将“salt example”变为“salt somethingelse”,需要逐个字符地删除大部分字符串,然后重新添加余数。

换句话说,你的距离函数不能满足你的欲望。

您应该首先找出合适的距离测量值。这可能是例如最长公共子串(salt Asalt B共有一个长度为5的子串),或者你也可以独立处理所有单词,使用对所有单词平均的标准化levenshtein等。

绝对花更多时间在距离函数上。