在正则表达式不匹配的地方添加NaN

时间:2019-08-30 00:51:41

标签: regex python-3.x string pandas nan

import pandas as pd
df= pd.DataFrame({'Date':['nothing ',
                              'This 1A1619 A124 person BL171111 the A-1-24 and ',
                              'dont Z112 but NOT 12-24-1981',
                               'nada here either',
                              'mix: 1A25629Q88 or A13B ok A1 the A16'],
                      'IDs': ['A11','B22','C33', 'D44', 'E55'],
                      })

这是对pulling mixed letters and numbers的后续跟进。使用此代码

pat = r'((?<!\S)(?:[a-zA-Z]+\d|\d+[a-zA-Z])[a-zA-Z0-9]*(?!\S))'
df['Date'].str.extractall(pat)

给我

        0
   match    
1   0   1A1619
    1   A124
    2   BL171111
2   0   Z112
4   0   1A25629Q88
    1   A13B
    2   A1
    3   A16

我希望在NaN不匹配的地方添加regex。所以我想要这个东西

        0
   match    
0   NaN
1   0   1A1619
1   A124
2   BL171111
2   0   Z112
3   NaN
4   0   1A25629Q88
    1   A13B
    2   A1
    3   A16

我该如何更改我的代码?

1 个答案:

答案 0 :(得分:1)

鉴于sdf['Date'].str.extractall(pat)的返回,我们可以:

i = df.index.difference(s.index.get_level_values(0))
o = pd.DataFrame({0: np.nan}, index=[i, [0]*len(i)])
adjust = lambda s,o: pd.concat([s, o]).sort_index()

然后

>>> adjust(s,o)

                  0
  match            
0 0             NaN
1 0          1A1619
  1            A124
  2        BL171111
2 0            Z112
3 0             NaN
4 0      1A25629Q88
  1            A13B
  2              A1
  3             A16