从熊猫的列中删除停用词,然后计算每行中已删除词的数量

时间:2019-11-13 10:27:31

标签: python regex pandas

我正在使用bbc-text.csv,该文件可在此处下载:https://storage.googleapis.com/laurencemoroney-blog.appspot.com/bbc-text.csv

我将假定您已经下载了文件,因此可以复制以下内容。

结构文件非常简单:

bbc = pd.read_csv('bbc-text.csv')

enter image description here

enter image description here

停用词如下:

stopwords = [ "a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]

要删除停用词,我使用在stackexchange帖子中找到的以下代码:

pat = r'\b(?:{})\b'.format('|'.join(stopwords))

enter image description here

bbc['text_clean'] = bbc['text'].str.replace(pat, '')

我不太明白为什么在左括号打开后正则表达式中需要使用?:

我尝试使用以下reg模式删除双精度,三精度等,但是我也去除了单个精度,这并不是我的关注。

pat = r'\s(\s+)'
bbc['text_clean'] = bbc['text_clean'].str.replace(pat, '')

另一方面,以下代码正确执行了任务:

pat = r'\s\s+'
bbc['text_clean'] = bbc['text_clean'].str.replace(pat, ' ')

您能告诉我为什么第一个代码片段失败而第二个代码片段成功吗?

下面两种方法都可以找到从每一行中删除了多少个单词:

enter image description here

但是,第三个引发异常。你能告诉我为什么吗?

bbc.apply(lambda x: (len(x['text']) - len(x['text_clean'])))

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-40-16f1495620d3> in <module>
----> 1 bbc.apply(lambda x: (len(x['text']) - len(x['text_clean'])))

~\Anaconda3\envs\tf2\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
   6926             kwds=kwds,
   6927         )
-> 6928         return op.get_result()
   6929 
   6930     def applymap(self, func):

~\Anaconda3\envs\tf2\lib\site-packages\pandas\core\apply.py in get_result(self)
    184             return self.apply_raw()
    185 
--> 186         return self.apply_standard()
    187 
    188     def apply_empty_result(self):

~\Anaconda3\envs\tf2\lib\site-packages\pandas\core\apply.py in apply_standard(self)
    290 
    291         # compute the result using the series generator
--> 292         self.apply_series_generator()
    293 
    294         # wrap results

~\Anaconda3\envs\tf2\lib\site-packages\pandas\core\apply.py in apply_series_generator(self)
    319             try:
    320                 for i, v in enumerate(series_gen):
--> 321                     results[i] = self.f(v)
    322                     keys.append(v.name)
    323             except Exception as e:

<ipython-input-40-16f1495620d3> in <lambda>(x)
----> 1 bbc.apply(lambda x: (len(x['text']) - len(x['text_clean'])))

~\Anaconda3\envs\tf2\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
   1069         key = com.apply_if_callable(key, self)
   1070         try:
-> 1071             result = self.index.get_value(self, key)
   1072 
   1073             if not is_scalar(result):

~\Anaconda3\envs\tf2\lib\site-packages\pandas\core\indexes\base.py in get_value(self, series, key)
   4728         k = self._convert_scalar_indexer(k, kind="getitem")
   4729         try:
-> 4730             return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
   4731         except KeyError as e1:
   4732             if len(self) > 0 and (self.holds_integer() or self.is_boolean()):

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()

KeyError: ('text', 'occurred at index category')

0 个答案:

没有答案