Question

具有以下数据框 df ：

   RID                  Other_aided Ultibro Relvar
0  701              {_12,_101,_102}    {_9}    NaN
1  702                 {_7,_11,_16}    {_7}    NaN
2  703  {_12,_101,_102,_10,_11,_16}    {_7}    NaN
3  704                  {_5,_3,_16}     NaN    NaN
4  705       {_101,_102,_10,_3,_16}    {_6}    NaN

要通过以下方式清洁 df ：

从 data 列中删除{}_。
NaN 需要替换为NULL字符串''。
整数（RID）的第一个ID列需要受到保护。

执行以下功能 f ：

import re
f = lambda x: re.sub(r'[^0-9,]','', x)

运行：

df.Other_aided.apply(f)对于具有适当数据的单列工作正常。
df.Ultibro.apply(f)和df.Relvar.apply(f)失败，TypeError: expected string or bytes-like object失败，这要归功于 NaN 。
因此...考虑将 data 列转换为字符串将有助于代码df.iloc[:, 1:].apply(lambda y: f(str(y)), axis=1)。但这不幸地失败了，给出了不正确的输出……如：

0         175,9,10,3,11,1612,101,102810109918280,
1                    159,10,37,11,16710710717281,
...

如何清理 df ？

Answer 1

如果要使用函数，请首先将NaN替换为空字符串，然后将其传递给DataFrame.applymap进行元素明智的处理：

f = lambda x: re.sub(r'[^0-9,]','', x)
df.iloc[:, 1:] = df.iloc[:, 1:].fillna('').applymap(f)
print (df)
   RID          Other_aided Ultibro Relvar
0  701           12,101,102       9       
1  702              7,11,16       7       
2  703  12,101,102,10,11,16       7       
3  704               5,3,16               
4  705      101,102,10,3,16       6

或使用DataFrame.replace：

df.iloc[:, 1:] = df.iloc[:, 1:].fillna('').replace(r'[^0-9,]','', regex=True)
print (df)
   RID          Other_aided Ultibro Relvar
0  701           12,101,102       9       
1  702              7,11,16       7       
2  703  12,101,102,10,11,16       7       
3  704               5,3,16               
4  705      101,102,10,3,16       6

#if never missing values in first column, so no repacing it to empty strings
df = df.fillna('').replace(r'[^0-9,]','', regex=True)
print (df)
   RID          Other_aided Ultibro Relvar
0  701           12,101,102       9       
1  702              7,11,16       7       
2  703  12,101,102,10,11,16       7       
3  704               5,3,16               
4  705      101,102,10,3,16       6

将re函数应用于混合的熊猫数据框

1 个答案: