我正在使用bbc-text.csv,该文件可在此处下载:https://storage.googleapis.com/laurencemoroney-blog.appspot.com/bbc-text.csv
我将假定您已经下载了文件,因此可以复制以下内容。
结构文件非常简单:
bbc = pd.read_csv('bbc-text.csv')
停用词如下:
stopwords = [ "a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]
要删除停用词,我使用在stackexchange帖子中找到的以下代码:
pat = r'\b(?:{})\b'.format('|'.join(stopwords))
bbc['text_clean'] = bbc['text'].str.replace(pat, '')
我不太明白为什么在左括号打开后正则表达式中需要使用?:
。
我尝试使用以下reg模式删除双精度,三精度等,但是我也去除了单个精度,这并不是我的关注。
pat = r'\s(\s+)'
bbc['text_clean'] = bbc['text_clean'].str.replace(pat, '')
另一方面,以下代码正确执行了任务:
pat = r'\s\s+'
bbc['text_clean'] = bbc['text_clean'].str.replace(pat, ' ')
您能告诉我为什么第一个代码片段失败而第二个代码片段成功吗?
下面两种方法都可以找到从每一行中删除了多少个单词:
但是,第三个引发异常。你能告诉我为什么吗?
bbc.apply(lambda x: (len(x['text']) - len(x['text_clean'])))
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-40-16f1495620d3> in <module>
----> 1 bbc.apply(lambda x: (len(x['text']) - len(x['text_clean'])))
~\Anaconda3\envs\tf2\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
6926 kwds=kwds,
6927 )
-> 6928 return op.get_result()
6929
6930 def applymap(self, func):
~\Anaconda3\envs\tf2\lib\site-packages\pandas\core\apply.py in get_result(self)
184 return self.apply_raw()
185
--> 186 return self.apply_standard()
187
188 def apply_empty_result(self):
~\Anaconda3\envs\tf2\lib\site-packages\pandas\core\apply.py in apply_standard(self)
290
291 # compute the result using the series generator
--> 292 self.apply_series_generator()
293
294 # wrap results
~\Anaconda3\envs\tf2\lib\site-packages\pandas\core\apply.py in apply_series_generator(self)
319 try:
320 for i, v in enumerate(series_gen):
--> 321 results[i] = self.f(v)
322 keys.append(v.name)
323 except Exception as e:
<ipython-input-40-16f1495620d3> in <lambda>(x)
----> 1 bbc.apply(lambda x: (len(x['text']) - len(x['text_clean'])))
~\Anaconda3\envs\tf2\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
1069 key = com.apply_if_callable(key, self)
1070 try:
-> 1071 result = self.index.get_value(self, key)
1072
1073 if not is_scalar(result):
~\Anaconda3\envs\tf2\lib\site-packages\pandas\core\indexes\base.py in get_value(self, series, key)
4728 k = self._convert_scalar_indexer(k, kind="getitem")
4729 try:
-> 4730 return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
4731 except KeyError as e1:
4732 if len(self) > 0 and (self.holds_integer() or self.is_boolean()):
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()
KeyError: ('text', 'occurred at index category')