这是我用来从pandas
中的列中删除标点符号的函数。
def remove_punctuation(text):
return re.sub(r'[^\w\s]','',text)
这就是我应用它的方式。
review_without_punctuation = products['review'].apply(remove_punctuation)
此处的产品是pandas
数据框。
这是我收到的错误消息。
TypeError Traceback (most recent call last)
<ipython-input-19-196c188dfb67> in <module>()
----> 1 review_without_punctuation = products['review'].apply(remove_punctuation)
/Users/username/Dropbox/workspace/private/pydev/ml/classification/.env/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
2292 else:
2293 values = self.asobject
-> 2294 mapped = lib.map_infer(values, f, convert=convert_dtype)
2295
2296 if len(mapped) and isinstance(mapped[0], Series):
pandas/src/inference.pyx in pandas.lib.map_infer (pandas/lib.c:66124)()
<ipython-input-18-0950dc65d8b8> in remove_punctuation(text)
1 def remove_punctuation(text):
----> 2 return re.sub(r'[^\w\s]','',text)
/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/re.py in sub(pattern, repl, string, count, flags)
189 a callable, it's passed the match object and must return
190 a replacement string to be used."""
--> 191 return _compile(pattern, flags).sub(repl, string, count)
192
193 def subn(pattern, repl, string, count=0, flags=0):
TypeError: expected string or bytes-like object
我做错了什么。
答案 0 :(得分:0)
您应该始终尽量避免在Pandas中通过apply()
运行纯Python代码。这很慢。相反,请使用每个str
property上存在的特殊Pandas string series:
In [9]: s = pd.Series(['hello', 'a,b,c', 'hmm...'])
In [10]: s.str.replace(r'[^\w\s]', '')
Out[10]:
0 hello
1 abc
2 hmm
dtype: object
答案 1 :(得分:0)
它不起作用,因为你的apply
被错误地应用了。
正确的做法是:
import re
s = pd.Series(['hello', 'a,b,c', 'hmm...'])
s.apply(lambda x: re.sub(r'[^\w\s]', '',x))
0 hello
1 abc
2 hmm
dtype: object
(给@John Zwinck regex
提示)
将此与另一种解决方案进行比较:
%timeit s.apply(lambda x: re.sub(r'[^\w\s]', '',x))
%timeit s.str.replace(r'[^\w\s]', '')
1000 loops, best of 3: 275 µs per loop
1000 loops, best of 3: 310 µs per loop