从字符串列表中删除常用词。蟒蛇NLTK

时间:2021-06-02 17:32:47

标签: python pandas nltk

我正在尝试从 python pandas 数据帧的一组字符串(文本)中删除常用词列表。数据框看起来像这样

 ['Item', 'Label', 'Comment']

我已经删除了停用词,但我做了一个词云,还有一些更常见的词我想删除以更好地了解这个问题。

这是我目前的工作代码,效果不错但不够好

# This recieves a sentence 1 at a time
# Use a loop if you want to process a dataset or a lambda
def nlp_preprocess(text, stopwords, lemmatizer, wordnet_map):
    # Remove punctuation
    text = re.sub('[^a-zA-Z]', ' ', text)
    # Remove tags
    text=re.sub("</?.*?>"," <> ",text)
    # Remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    # Remove stop words like and is a and the
    text = " ".join([word for word in text.split() if word not in stopwords])
    # Find base word for all words in the sentence
    pos_tagged_text = nltk.pos_tag(text.split())
    text = " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text])
    return text

def full_nlp_text_process(df, pandas_parms, stopwords, lemmatizer, wordnet_map):
    data = preprocess_dataframe(df, pandas_params)
    nlp_data = data.copy()
    nlp_data["ProComment"] = nlp_data['Comment'].apply(lambda x: nlp_preprocess(x, stopword, lemmatizer, wordnet_map))
    return data, nlp_data

我知道我想要类似的东西,但我不知道我应该如何把它放在那里以删除单词以及我应该把它放在什么地方(即在文本处理或数据帧过程中)\

fdist2 = nltk.FreqDist(text)
most_list = fdist2.most_common(10)
# Somewhere else
for t in text:
   if t in most_list: text.remove(t)

1 个答案:

答案 0 :(得分:0)

我想出了一种新方法,而且效果很好。 第一种方法是这个

def find_common_words(df):
    full_text = ""
    for index, row in df.iterrows():
        #print(row['Comment'])
        full_text = full_text + " " + row["ProComment"]
    allWords = nltk.tokenize.word_tokenize(full_text)
    allWordDist = nltk.FreqDist(w.lower() for w in allWords)
    mostCommon= allWordDist.most_common(10)
    common_words = []
    for item in mostCommon:
        common_words.append(item[0])
    return common_words

然后你将需要这个方法来完成事情。

def remove_common_words(df, common_words):
    for index, row in df.iterrows():
        sentence = row["ProComment"]
        word_tokens = word_tokenize(sentence)
        filtered_sentence = " "
        for w in word_tokens:
            if w not in common_words:
                filtered_sentence = filtered_sentence + " " + w
        row["ProComment"] = filtered_sentence

通过这种方式,您可以接收一个数据框并通过它完成所有处理。