Question

我正在尝试编写一个函数，它将获取单词（字符串）列表，计算每个特定单词出现的次数，并返回一个字典，其中单词出现在列表中的次数除以总数列表中的单词（术语频率向量）。

def makeTermFrequencyVector(wordList):
'''
makeTermFrequencyVector Takes a list of words as parameter and returns a dictionary representing the term frequency
vector of the word list, where words are keys and values are the frequency of occurrence of
each word in the document.
'''
tfDict = {}
for word in wordList:
    for i in range(len(wordList)):
        state = 0
        if wordList[i] == word:
            state += 1
    tfv = state / (len(wordList))
    tfDict[word] = tfv
return tfDict

如果我输入：

makeTermFrequencyVector(['cat', 'dog']):

输出应为：

{'cat': 0.5, 'dog': 0.5}

因为每个单词在总长度为2的列表中出现一次。

但是，此代码返回一个字典，其中只有输入列表中的最后一个单词具有正确的tf值，所有其他单词的值为0.因此，如果我尝试在当前代码中输入上面的列表，则返回：

{'dog': 0.5, 'cat': 0.0}

这是不正确的。

如何解决这个问题，以便迭代列表中每个单词的值，而不仅仅是最后一个单词？我想让固定代码尽可能接近我当前的代码。

Answer 1

如果我们制作单独的传球而不是嵌套传球，这会更简单。在第一遍，我们采用字数。在第二遍，我们用频率替换字数：

def makeTermFrequencyVector(wordList):
    '''
    Takes a list of words and returns a dictionary representing
    the term frequency vector of the word list, where words are
    keys and values are the frequency of occurrence.
    '''

    tfDict = dict()

    for word in wordList:
        tfDict[word] = tfDict.get(word, 0) + 1

    word_count = len(wordList)

    for word in tfDict:
        tfDict[word] /= word_count

    return tfDict

print(makeTermFrequencyVector(['cat', 'dog']))

word_list = [ \
    'Takes', 'a', 'list', 'of', 'words', 'as', 'its', 'sole', 'parameter', \
    'and', 'returns', 'a', 'dictionary', 'representing', 'the', 'term', \
    'frequency', 'vector', 'of', 'the', 'word', 'list,', 'where', 'words', \
    'are', 'keys', 'and', 'values', 'are', 'the', 'frequency', 'of', \
    'occurrence', 'of', 'each', 'word', 'in', 'the', 'source', 'document', \
]

print(makeTermFrequencyVector(word_list))

<强>输出

> python3 test.py
{'cat': 0.5, 'dog': 0.5}
{'Takes': 0.025, 'a': 0.05, 'list': 0.025, 'of': 0.1, 'words': 0.05, 'as': 0.025, 'its': 0.025, 'sole': 0.025, 'parameter': 0.025, 'and': 0.05, 'returns': 0.025, 'dictionary': 0.025, 'representing': 0.025, 'the': 0.1, 'term': 0.025, 'frequency': 0.05, 'vector': 0.025, 'word': 0.05, 'list,': 0.025, 'where': 0.025, 'are': 0.05, 'keys': 0.025, 'values': 0.025, 'occurrence': 0.025, 'each': 0.025, 'in': 0.025, 'source': 0.025, 'document': 0.025}
>

Answer 2

cdlane使用2遍方法是使用嵌套for循环的方法。原因是每次传递将花费O（n）时间，其中n是列表的长度。通过两次传递，它将是O（n）+ O（n）= O（2n）时间，但是常数被丢弃以产生O（n）渐近运行时间。

您的代码无法工作的部分原因是因为state被置于内部循环中，因此对于该循环的每次迭代，状态将重置为0，而不是每次都递增。如果你采用行state = 0并将其从内部for循环中删除，我认为逻辑应该可行。

制作术语频率值的字典

2 个答案: