Question

我有一个数据框“ data”，我想用任何内容替换给定列中的所有标点符号（所以我想删除它们）。

在使用神经网络之前，我使用Python 3和Pandas and Numpy对文本进行了预格式化。

symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
dataClean = data['description']

for i in symbols:
    dataClean = np.char.replace(dataClean,i,"")

我希望，对于dataClean中的每个项目（从0到2549），每行中包含的每个字符串都会删除标点符号。但是我得到了这个回报：

TypeError                                 Traceback (most recent call last)
<ipython-input-87-aa944ae6e61c> in <module>
      3 
      4 for i in symbols:
----> 5     dataClean = np.char.replace(dataClean,i,"")
      6 
      7 print(dataClean[2])

~\Anaconda3\lib\site-packages\numpy\core\defchararray.py in replace(a, old, new, count)
   1184     return _to_string_or_unicode_array(
   1185         _vec_string(
-> 1186             a, object_, 'replace', [old, new] + _clean_args(count)))
   1187 
   1188 

TypeError: string operation on non-string array

Answer 1

如果dataClean是熊猫字符串系列，则可以使用Series.str.translate方法：

symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
dataClean = data['description']
dataClean = dataClean.str.translate({ord(symbol):"" for symbol in symbols})

例如，假设我们有一个数据框df：

In [59]: df = pd.DataFrame({'data':['[Yes?]', '(No!)', 100]}); df
Out[59]: 
     data
0  [Yes?]
1   (No!)
2     100

然后，我们可以制作一个dict将unicode序数映射到字符串（在这种情况下为空字符串）：

In [52]: symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
In [57]: {ord(symbol):"" for symbol in symbols}
Out[57]: 
{33: '',
 34: '',
 ...
 126: '',
 10: ''}

每个Unicode序数，或code point，对应于一个Unicode字符。 Python3字符串是一串unicode字符。对于系列中的每个字符串，translate方法用dict映射给出的相应字符串替换字符串中的每个字符。

In [60]: df['data'].str.translate({ord(symbol):"" for symbol in symbols})
Out[60]: 
0    Yes
1     No
2    NaN
Name: data, dtype: object

请注意，translate会将第三行中的100等非字符串映射到NaN。

Answer 2

您可以使用：

symbols = "[!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n]"
dataClean = dataClean.str.replace(symbols, "")

在Python中替换字符将返回TypeError：非字符串数组上的字符串操作

2 个答案: