Question

我尝试使用Python和NLTK对文本字符串进行文本分类，这些文本字符串的长度通常只有10-20个字。

我想计算单词频率和大小为2-4的ngram，并以某种方式将它们转换为向量并使用它来构建SVN模型。

我认为可能有一种非常标准的NLTK方式来做所有这些事情，但我找不到它。

我认为标准的方式可能已经很明智了，例如词干（所以＆＃34;重要＆＃34;＆＃34;重要性＆＃34;将被视为同一个词），丢弃标点符号，超常用的英文单词，并且可能会实现一种聪明的方法将这些计数转化为我的向量。我是文本分类和python的新手，我对两者的所有建议持开放态度！

Answer 1

好的，我第一次尝试回答堆栈溢出问题......

你的问题有点模糊，所以我会尽力回答它。听起来你在构建SVN模型之前询问如何准备文本，特别是如何对文本输入进行词形变换，计算单词频率，以及从给定字符串创建n-gram。

import nltk
from collections import Counter
from nltk import ngrams
from nltk.stem import WordNetLemmatizer


# lowercase, remove punctuation, and lemmatize string
def word_generator(str):
    wnl = WordNetLemmatizer()
    clean = nltk.word_tokenize(str)
    words = [wnl.lemmatize(word.lower()) for word in clean if word.isalpha()]
    for word in words:
        yield word


# create list of freqs
def freq_count(str):
    voc_freq = Counter()
    for word in word_generator(str):
        voc_freq[word] += 1
    trimmed = sorted(voc_freq.items(), reverse=True, key=lambda x: x[1])
    return trimmed


# create n-grams
def make_ngrams(str, n):
    grams = ngrams([word for word in word_generator(str)], n)
    return list(grams)

示例4-gram输出：

>>> my_str = 'This is this string, not A great Strings not the greatest string'

>>> print(freq_count(my_str))
[('string', 3), ('this', 2), ('not', 2), ('is', 1), ('a', 1), ('great', 1), ('the', 1), ('greatest', 1)]

>>> print(make_ngrams(my_str, 4))
[('this', 'is', 'this', 'string'), ('is', 'this', 'string', 'not'), ('this', 'string', 'not', 'a'), ('string', 'not', 'a', 'great'), ('not', 'a', 'great', 'string'), ('a', 'great', 'string', 'not'), ('great', 'string', 'not', 'the'), ('string', 'not', 'the', 'greatest'), ('not', 'the', 'greatest', 'string')]

然后你就可以做任何你想做的事情，比如创建矢量。

如何将字符串中的ngram和字频转换为矢量以构建SVN模型

1 个答案: