算不了。令牌化后的令牌,停止删除词和词干

时间:2017-06-04 16:43:10

标签: python string nltk preprocessor

我有以下功能:

def preprocessText (data):
    stemmer = nltk.stem.porter.PorterStemmer()
    preprocessed = []
    for each in data:
        tokens = nltk.word_tokenize(each.lower().translate(string.punctuation))
        filtered = [word for word in tokens if word not in nltk.corpus.stopwords.words('english')]
        preprocessed.append([stemmer.stem(item) for item in filtered])
    print(Counter(tokens).most_common(10))
    return (np.array(preprocessed))

应使用Porter Stemmer删除标点,标记,删除停用词和词干。但是,它无法正常工作。例如,当我运行此代码时:

s = ["The cow and of.", "and of dog the."]
print (Counter(preprocessText(s)))

它产生这个输出:

[('and', 1), ('.', 1), ('dog', 1), ('the', 1), ('of', 1)]

不会删除标点符号或停用词。

2 个答案:

答案 0 :(得分:2)

您的翻译无法删除标点符号。这是一些工作代码。我做了一些改动,其中最重要的是:

代码:

npm run build

测试代码:

xlate = {ord(x): y for x, y in
         zip(string.punctuation, ' ' * len(string.punctuation))}
tokens = nltk.word_tokenize(each.lower().translate(xlate))

结果:

from collections import Counter
import nltk
import string

stopwords = set(nltk.corpus.stopwords.words('english'))
try:
    # python 2
    xlate = string.maketrans(
        string.punctuation, ' ' * len(string.punctuation))
except AttributeError:
    xlate = {ord(x): y for x, y in
             zip(string.punctuation, ' ' * len(string.punctuation))}

def preprocessText(data):
    stemmer = nltk.stem.porter.PorterStemmer()
    preprocessed = []
    for each in data:
        tokens = nltk.word_tokenize(each.lower().translate(xlate))
        filtered = [word for word in tokens if word not in stopwords]
        preprocessed.append([stemmer.stem(item) for item in filtered])
    return np.array(preprocessed)

s = ["The cow and of.", "and of dog the."]
print(Counter(sum([list(x) for x in preprocessText(s)], [])))

答案 1 :(得分:0)

问题在于你滥用translate。要正确使用它,您需要创建一个映射表(如帮助字符串将告诉您的)将“Unicode序数映射到Unicode序数,字符串或无”。例如,像这样:

>>> mapping = dict((ord(x), None) for x in string.punctuation)  # `None` means "delete"
>>> print("This.and.that".translate(mapping))
'Thisandthat'

但是,如果你这样做是为了标记令牌,那么你只需要用空字符串替换标点符号。您可以添加一个步骤来摆脱它们,但我建议您只选择您想要的内容:即字母数字字。

tokens = nltk.word_tokenize(each.lower() if each.isalnum())

您需要更改代码。