Question

代码：

def add_lexical_features(fdist, feature_vector):
for word, freq in fdist.items():
    fname = "unigram:{0}".format(word)

    if selected_features == None or fname in selected_features:
        feature_vector[fname] = 1

    if selected_features == None or fname in selected_features:
         feature_vector[fname] = float(freq) / fdist.N()
         print(feature_vector)

if __name__ == '__main__':
file_name = "restaurant-training.data"
p =  process_reviews(file_name)
for i in range(0, len(p)):
    print(p[i]+ "\n")
    uni_dist = nltk.FreqDist(p[0])
    feature_vector = {}
    x = add_lexical_features(uni_dist, feature_vector)

这是尝试做的是输出评论列表中的单词频率（p是评论列表，p [0]是字符串）。这是有效的....除了它是通过字母，而不是我的话。

我还是NLTK的新手，所以这可能很明显，但我真的无法得到它。

例如，这会输出大量的内容，例如：

{'unigram：n'：0.0783132530120482}

这很好，我认为这是正确的数字（时间n出现在总字母数上）但我希望它是按字母而不是字母。

现在，我也希望它由bigrams来做，一旦我能用单个单词进行工作，使得双字可能很容易，但我不太看到它，所以一些指导他们会很好。

感谢。

Answer 1

nltk.FreqDist的输入应该是字符串列表，而不仅仅是字符串。看到差异：

>>> import nltk
>>> uni_dist = nltk.FreqDist(['the', 'dog', 'went', 'to', 'the', 'park'])
>>> uni_dist
FreqDist({'the': 2, 'went': 1, 'park': 1, 'dog': 1, 'to': 1})
>>> uni_dist2 = nltk.FreqDist('the dog went to the park')
>>> uni_dist2
FreqDist({' ': 5, 't': 4, 'e': 3, 'h': 2, 'o': 2, 'a': 1, 'd': 1, 'g': 1, 'k': 1, 'n': 1, ...})

您可以使用split将字符串转换为单个字词列表。

附注：我想您可能希望在nltk.FreqDist而不是p[i]上致电p[0]。

我的NLTK代码几乎可以满足我的需求，但并不完全

1 个答案: