如何从语料库中删除荒谬或不完整的单词?

时间:2018-07-09 02:56:52

标签: python machine-learning nlp deep-learning data-cleaning

我正在对某些NLP分析使用一些文本。我已经清理了文本,并采取步骤删除了非字母数字字符,空格,重复的单词和停用词,并进行了词干和词形修饰:

from nltk.tokenize import word_tokenize
import nltk.corpus
import re
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import pandas as pd

data_df = pd.read_csv('path/to/file/data.csv')

stopwords = nltk.corpus.stopwords.words('english') 

stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

# Function to remove duplicates from sentence
def unique_list(l):
    ulist = []
    [ulist.append(x) for x in l if x not in ulist]
    return ulist

for i in range(len(data_df)):

    # Convert to lower case, split into individual words using word_tokenize
    sentence = word_tokenize(data_df['O_Q1A'][i].lower()) #data['O_Q1A'][i].split(' ')

    # Remove stopwords
    filtered_sentence = [w for w in sentence if not w in stopwords]

    # Remove duplicate words from sentence
    filtered_sentence = unique_list(filtered_sentence)

    # Remove non-letters
    junk_free_sentence = []
    for word in filtered_sentence:
        junk_free_sentence.append(re.sub("[^\w\s]", " ", word)) # Remove non-letters, but don't remove whitespaces just yet
        #junk_free_sentence.append(re.sub("/^[a-z]+$/", " ", word)) # Take only alphabests

    # Stem the junk free sentence
    stemmed_sentence = []
    for w in junk_free_sentence:
        stemmed_sentence.append(stemmer.stem(w))

    # Lemmatize the stemmed sentence
    lemmatized_sentence = []
    for w in stemmed_sentence:
        lemmatized_sentence.append(lemmatizer.lemmatize(w))

    data_df['O_Q1A'][i] = ' '.join(lemmatized_sentence)

但是当我显示前10个字时(根据某些条件),我仍然会遇到一些垃圾,例如:

ask
much
thank
work
le
know
via
sdh
n
sy
t
n t
recommend
never

在前10个单词中,只有5个是明智的(askknowrecommendthankwork)。我还需要做些什么来仅保留有意义的单词?

1 个答案:

答案 0 :(得分:0)

默认的NLTK非索引字表是最小的,它肯定不会包含“问”,“很多”之类的词,因为它们通常不是毫无意义的。这些话仅与您无关,而与其他人无关。对于您的问题,在使用NLTK之后,您始终可以使用自定义停用词过滤器。一个简单的例子:

def removeStopWords(str):
#select english stopwords
cachedStopWords = set(stopwords.words("english"))
#add custom words
cachedStopWords.update(('ask','much','thank','etc.'))
#remove stop words
new_str = ' '.join([word for word in str.split() if word not in cachedStopWords]) 
return new_str

或者,您可以编辑NLTK停用词列表,该列表本质上是一个文本文件,存储在NLTK安装目录中。