应用正则表达式创建新列-isdigit()vs isnumeric()

时间:2019-08-04 11:06:44

标签: python regex python-3.x pandas dataframe

我有一个如下所示的数据框

data_file= pd.DataFrame({'pid':[1,1.5,6.557657,'ABCD','1+','TRACE']})

看起来如下图

enter image description here

我想要创建两个新列value_as_numbervalue_as_string

这是我尝试过的

value_as_string = data_file['pid'].str.extract('(\D+)') # this chops of the `1` from `1+` which isn't expected.

value_as_string的输出如下所示

enter image description here

value_as_number = ~data_file['pid'].str.extract('(\D+)') # results in error as shown below

TypeError: bad operand type for unary ~: 'float'

我也尝试过,但是也没有帮助

data_file['pid'].str.isnumeric()
data_file['pid'].str.digit()

我希望我的输出如下所示。数字的单​​独列(例如1,2,1.5,4.5),数字,字符和符号的混合使用单独列(1 +,ABCD,测试)等

enter image description here

4 个答案:

答案 0 :(得分:3)

您可以使用pd.to_numericdf.where

data_file['num'] = pd.to_numeric(data_file['pid'],errors='coerce')

data_file['alpha'] = data_file['pid'].where(data_file['num'].isnull())

       pid       num  alpha
0        1  1.000000    NaN
1      1.5  1.500000    NaN
2  6.55766  6.557657    NaN
3     ABCD       NaN   ABCD
4       1+       NaN     1+
5    TRACE       NaN  TRACE

最后,您可以使用fillna(''),但不要对数字列使用它。

答案 1 :(得分:2)

您不需要正则表达式。以下代码将为您提供所需的内容,但您将获得object列。

import pandas as pd

data_file = pd.DataFrame({'pid':[1,1.5,6.557657,'ABCD','1+','TRACE']})
data_file['numbers'] = data_file['pid'].map(lambda x: x if type(x) in [int, float] else '')
data_file['strings'] = data_file['pid'].map(lambda s: s if type(s) is str else '')

这是输出:

        pid  numbers strings
0        1        1
1      1.5      1.5
2  6.55766  6.55766
3     ABCD             ABCD
4       1+               1+
5    TRACE            TRACE

答案 2 :(得分:1)

如果需要区分数字值与混合类型的字符串,请使用isinstance

data_file= pd.DataFrame({'pid':[1,1.5,6.557657,'ABCD','1+','TRACE']})

mask = data_file['pid'].apply(lambda x: isinstance(x, (float, int)))

data_file['value_as_number'] = data_file['pid'].where(mask)
data_file['value_as_string'] = data_file['pid'].mask(mask)
print (data_file)
       pid value_as_number value_as_string
0        1               1             NaN
1      1.5             1.5             NaN
2  6.55766         6.55766             NaN
3     ABCD             NaN            ABCD
4       1+             NaN              1+
5    TRACE             NaN           TRACE

如果所有值都是字符串,则一种可能的解决方案是在Series.str.contains中将模式用于测试整数和浮点数:

mask = data_file['pid'].astype(str).str.contains('^\d+$|^\d+\.\d+$')

或用于测试数字的自定义函数:

def test(x):
    try:
        float(x)
        return True
    except Exception:
        return False

mask = data_file['pid'].apply(test)

答案 3 :(得分:1)

使用str.replacestr.isnumeric

m1 = data_file['pid'].astype(str).str.replace('.', '', n=1).str.isnumeric()
m2 = ~m1

data_file['value_as_number'] = data_file['pid'].where(m1)
data_file['value_as_string'] = data_file['pid'].where(m2)

输出

         pid value_as_number value_as_string
0          1               1             NaN
1        1.5             1.5             NaN
2    6.55766         6.55766             NaN
3       ABCD             NaN            ABCD
4         1+             NaN              1+
5      TRACE             NaN           TRACE
6  1.212.333             NaN       1.212.333
7     1....1             NaN          1....1