我想删除数据集中的废话。
我尝试过在StackOverflow中看到类似这样的内容:
import nltk
words = set(nltk.corpus.words.words())
sent = "Io andiamo to the beach with my amico."
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
if w.lower() in words or not w.isalpha())
但是现在既然我有了一个数据框,我该如何遍历整个列。
我尝试过这样的事情:
import nltk
words = set(nltk.corpus.words.words())
sent = df['Chats']
df['Chats'] = df['Chats'].apply(lambda w:" ".join(w for w in
nltk.wordpunct_tokenize(sent) \
if w.lower() in words or not w.isalpha()))
但是我收到一个错误TypeError:预期的字符串或类似字节的对象
答案 0 :(得分:0)
类似如下的内容将生成列Clean
,该列将您的功能应用于列Chats
words = set(nltk.corpus.words.words())
def clean_sent(sent):
return " ".join(w for w in nltk.wordpunct_tokenize(sent) \
if w.lower() in words or not w.isalpha())
df['Clean'] = df['Chats'].apply(clean_sent)
要更新Chats
列本身,可以使用原始列将其覆盖:
df['Chats'] = df['Chats'].apply(clean_sent)