Question

我有一个字符串列表：

['twas', 'brillig', 'and', 'the', 'slithy', 'toves', 'did', 'gyre', 'and', 'gimble', 'in', 'the', 'wabe', 'all', 'mimsy', 'were', 'the', 'borogoves', 'and', 'the', 'mome', 'raths', 'outgrabe', '"beware', 'the', 'jabberwock', 'my', 'son', 'the', 'jaws', 'that', 'bite', 'the', 'claws', 'that', 'catch', 'beware', 'the', 'jubjub', 'bird', 'and', 'shun', 'the', 'frumious', 'bandersnatch', 'he', 'took', 'his', 'vorpal', 'sword', 'in', 'hand', 'long', 'time', 'the', 'manxome', 'foe', 'he', 'sought', 'so', 'rested', 'he', 'by', 'the', 'tumtum', 'tree', 'and', 'stood', 'awhile', 'in', 'thought', 'and', 'as', 'in', 'uffish', 'thought', 'he', 'stood', 'the', 'jabberwock', 'with', 'eyes', 'of', 'flame', 'came', 'whiffling', 'through', 'the', 'tulgey', 'wood', 'and', 'burbled', 'as', 'it', 'came', 'one', 'two', 'one', 'two', 'and', 'through', 'and', 'through', 'the', 'vorpal', 'blade', 'went', 'snicker-snack', 'he', 'left', 'it', 'dead', 'and', 'with', 'its', 'head', 'he', 'went', 'galumphing', 'back', '"and', 'has', 'thou', 'slain', 'the', 'jabberwock', 'come', 'to', 'my', 'arms', 'my', 'beamish', 'boy', 'o', 'frabjous', 'day', 'callooh', 'callay', 'he', 'chortled', 'in', 'his', 'joy', '`twas', 'brillig', 'and', 'the', 'slithy', 'toves', 'did', 'gyre', 'and', 'gimble', 'in', 'the', 'wabe', 'all', 'mimsy', 'were', 'the', 'borogoves', 'and', 'the', 'mome', 'raths', 'outgrabe']

如何根据与列表中所有其他单词的最小相似度和平均相似度值（作为浮点数）返回与字符串中其他单词最不同的单词列表。

我绝对不知道如何做到这一点。我想我需要使用cossim（word1，word2）函数来计算'word1'和'word2'之间的相似性，因为我们的讲师已经给出了这个函数，但我不知道如何使用它。

def cossim(word1,word2):
"""Calculate the cosine similarity between the two words"""

# sub-function for constructing a letter vector from argument `word`
# which returns the tuple `(vec,veclen)`, where `vec` is a dictionary of
# characters in `word`, and `veclen` is the length of the vector
def wordvec(word):
    vec = defaultdict(int)  # letter vector

    # count the letters in the word
    for char in word:
        vec[char] += 1

    # calculate the length of the letter vector
    len = 0.0
    for char in vec:
        len += vec[char]**2

    # return the letter vector and vector length
    return vec,math.sqrt(len)

# calculate a vector,length tuple for each of `word1` and `word2`
vec1,len1 = wordvec(word1)
vec2,len2 = wordvec(word2)

# calculate the dot product between the letter vectors for the two words
dotprod = 0.0
for char in vec1:
    dotprod += vec1[char]*vec2[char]

# divide by the lengths of the two vectors
if dotprod:
    dotprod /= len1*len2

return dotprod

我应该从上面的列表中得到答案：

({'my'], 0.088487238234566931)

非常感谢任何帮助，

谢谢，

基利

Answer 1

在使用Robert Rossney建议的方法之前，需要对单词列表进行重复删除首先。否则，结果数会稍微关闭，因为同一w可能会在一个d[word]中多次出现。

一种可行的方法是从列表中创建一个集合：

set_of_words = set(mylist)
differences = {}
for word in set_of_words:
    differences[word] = [cossim(word, word2) for word2 in set_of_words if word != word2]

这会创建一个字典，为每个单词分配一个与其他单词的差异列表。

除了将这些列表直接分配给字典条目外，您还可以将它们保存在循环中的变量中，并使用该变量计算Robert解决方案中建议的平均值。

字典函数iteritems允许您迭代(key, value) - 对，min函数有一个特殊参数key来指定要最小化的内容，例如{{1按元组或列表的第二个元素排序。

Answer 2

对于起点，您可能希望构建一个字典，其键是列表中的单词，其值是列表中的所有其他单词：

d = {}
for word in mylist:
   d[word] = [w for w in mylist if w != word]

这使您可以快速计算每个单词的相似度值：

similarities = {}
for word in mylist:
   similarities[word] = [cossim(w, word) for w in d[word]]

由此可以轻松计算每个单词的最小和平均相似度。

Answer 3

因此，如果我理解正确的话，目标就是用所有其他词语找到cossim最小总和的单词。为此，以下代码就足够了：

/* removed at the reasonable request of agf */

从高层次的角度来看，我们正在做的是循环遍历列表中的每个单词，并检查它与所有其他单词的相似程度。如果它与我们迄今为止看到的任何其他词语不太相似，我们会存储它。然后我们的输出是与所有其他单词具有最低相似性的单词。

Answer 4

我认为Python-Levenshtein (pypi link)模块可能有助于获得word1和word2的相似性：

使用两个功能：

import Levenshtein

str1 = 'abcde'
str2 = 'abcdf'
print(Levenshtein.distance(str1,str2))
# 1
print(Levenshtein.ratio(str1,str2))
# 0.8

就够了。

从字符串列表中，你如何得到python中最奇怪的单词/字符串

4 个答案: