Question

以下是我的数据框的外观
mydf =

col1    Col2    Col3                 Col4 
0   val1    1x  \n\t\t\t\t\t\t3x\n\t\t\t\t\t    Calculate
1   val2    1x  \n\t\t\t\t\t\t3x\n\t\t\t\t\t    Calculate
2   val3    1x  \n\t\t\t\t\t\t12.5x\n\t\t\t\t\t Calculated
3   val4    1x  \n\t\t\t\t\t\t8x\n\t\t\t\t\t        Calculated
4   val5    1x  \n\t\t\t\t\t\t10x\n\t\t\t\t\t   Calculate
5   val18   1x  \n\t\t\t\t\t\t6.3x\n\t\t\t\t\t  Calculate

我要从Col4中提取数字（包括小数位。

但是，正则表达式模式不适用于我。

mydf[Col4].str.extract('[1-9]\d*(\.\d+)?') <br>

对于大多数行，它返回NaN，对于带小数的行将返回.5 / .3（即，仅十进制值）

我尝试使用re.search来检查我的模式，并且它可以工作。

newstr = mydf[col4][5] 
re.search('[1-9]\d*(\.\d+)?', newstr)

newstr变为-'\ n \ t \ t \ t \ t \ t \ t \ t12.5x \ n \ t \ t \ t \ t \ t \ t' （双反斜杠）。以上返回

re.Match object; span=(14, 18), match='12.5'</b>

符合预期。

好像我缺少明显的东西。

Answer 1

使用str.findall

df.Col3.str.findall(r'[-+]?\d*\.\d+|\d+').str[0]#notice here I also extract the sign
0       3
1       3
2    12.5
3       8
4      10
5     6.3
Name: Col3, dtype: object

Answer 2

看起来您也可以strip并避免使用正则表达式

df.Col3.str.strip().str[:-1]

0       3
1       3
2    12.5
3       8
4      10
5     6.3
Name: Col3, dtype: object

熊猫系列str.extract无法匹配RegEx模式

2 个答案: