Question

这是我用来从pandas中的列中删除标点符号的函数。

def remove_punctuation(text):
    return re.sub(r'[^\w\s]','',text)

这就是我应用它的方式。

review_without_punctuation = products['review'].apply(remove_punctuation)

此处的产品是pandas数据框。

这是我收到的错误消息。

TypeError                                 Traceback (most recent call last)
<ipython-input-19-196c188dfb67> in <module>()
----> 1 review_without_punctuation = products['review'].apply(remove_punctuation)

/Users/username/Dropbox/workspace/private/pydev/ml/classification/.env/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
   2292             else:
   2293                 values = self.asobject
-> 2294                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2295 
   2296         if len(mapped) and isinstance(mapped[0], Series):

pandas/src/inference.pyx in pandas.lib.map_infer (pandas/lib.c:66124)()

<ipython-input-18-0950dc65d8b8> in remove_punctuation(text)
      1 def remove_punctuation(text):
----> 2     return re.sub(r'[^\w\s]','',text)

/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/re.py in sub(pattern, repl, string, count, flags)
    189     a callable, it's passed the match object and must return
    190     a replacement string to be used."""
--> 191     return _compile(pattern, flags).sub(repl, string, count)
    192 
    193 def subn(pattern, repl, string, count=0, flags=0):

TypeError: expected string or bytes-like object

我做错了什么。

Answer 1

您应该始终尽量避免在Pandas中通过apply()运行纯Python代码。这很慢。相反，请使用每个str property上存在的特殊Pandas string series：

In [9]: s = pd.Series(['hello', 'a,b,c', 'hmm...'])
In [10]: s.str.replace(r'[^\w\s]', '')
Out[10]: 
0    hello
1      abc
2      hmm
dtype: object

Answer 2

它不起作用，因为你的apply被错误地应用了。

正确的做法是：

import re
s = pd.Series(['hello', 'a,b,c', 'hmm...'])
s.apply(lambda x: re.sub(r'[^\w\s]', '',x))
0    hello
1      abc
2      hmm
dtype: object

（给@John Zwinck regex提示）

将此与另一种解决方案进行比较：

%timeit s.apply(lambda x: re.sub(r'[^\w\s]', '',x))
%timeit s.str.replace(r'[^\w\s]', '')
1000 loops, best of 3: 275 µs per loop
1000 loops, best of 3: 310 µs per loop

试图从Pandas中的列中删除标点符号

2 个答案: