如何在python中删除amazon_baby.csv中的停用词

时间:2018-06-09 13:02:46

标签: python-3.x scikit-learn nltk

我想删除Amazon_baby.csv中的停用词和标点符号。

import pandas as pd
data=pd.read_csv('amazon_baby.csv)
data.fillna(value='',inplace=True)
data.head()

Amazon_baby.csv

import string
from nltk.corpus import stopwords

def text_process(msg):      
    no_punc=[char for char in msg if char not string.punctuation]
    no_punc=''.join(no_punc)

   return [word for word in no_punc.split() if word.lower() not in stopwords.words('English')]

data['review'].apply(text_process)

此代码执行多达10k行,如果应用于整个数据集内核,则始终显示为忙且单元格未执行。

请帮忙。

查找数据集here

2 个答案:

答案 0 :(得分:1)

您正在通过char处理数据char,这非常慢。

由于数据量很大(~183531行),我们必须单独处理每一行,这使得O(n 2 )变得复杂。 我使用下面的word_tokenize实现了稍微不同的方法:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def remove_punction_and_stopwords(msg):
   stop_words = set(stopwords.words('english'))
   word_tokens = word_tokenize(msg)
   filtered_words = [w for w in msg if w not in word_tokens and w not in string.punctuation]
   new_sentence = ''.join(filtered_words)
   return new_sentence

我尝试运行它6分钟,它处理了136322行。我敢肯定,如果我已经运行了10分钟,它将成功完成执行。

答案 1 :(得分:-1)

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def text_clean(msg):

tokens=word_tokenize(msg)
tokens=[w.lower() for w in tokens]
import string
stop_words=set(stopwords.words('english))
no_punc_and_stop_words=[w for w in tokens if w not in string.punctuation and w not in stop_words]  

return words