在Python中删除废话

时间:2018-10-12 21:38:48

标签: python machine-learning nlp nltk

我想删除数据集中的废话。

我尝试过在StackOverflow中看到类似这样的内容:

import nltk
words = set(nltk.corpus.words.words())

sent = "Io andiamo to the beach with my amico."
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
     if w.lower() in words or not w.isalpha())

但是现在既然我有了一个数据框,我该如何遍历整个列。

我尝试过这样的事情:

import nltk
words = set(nltk.corpus.words.words())

sent = df['Chats']
df['Chats'] = df['Chats'].apply(lambda w:" ".join(w for w in 
nltk.wordpunct_tokenize(sent) \
     if w.lower() in words or not w.isalpha()))

但是我收到一个错误TypeError:预期的字符串或类似字节的对象

1 个答案:

答案 0 :(得分:0)

类似如下的内容将生成列Clean,该列将您的功能应用于列Chats

words = set(nltk.corpus.words.words())

def clean_sent(sent):
    return " ".join(w for w in nltk.wordpunct_tokenize(sent) \
     if w.lower() in words or not w.isalpha())

df['Clean'] = df['Chats'].apply(clean_sent)

要更新Chats列本身,可以使用原始列将其覆盖:

df['Chats'] = df['Chats'].apply(clean_sent)