布尔当量的pandas to_numeric()

时间:2017-05-04 17:48:46

标签: python pandas dataframe etl petl

我正在搜索pandas to_numeric()的布尔等价物 我希望函数尽可能将列转换为True / False / nan,如果不是则抛出错误。

我的动机是我需要在数据集中自动识别和转换大约1000列的布尔列。我可以使用以下代码对浮点数/整数执行类似的操作:

df = df_raw.apply(pd.to_numeric, errors='ignore')

4 个答案:

答案 0 :(得分:6)

由于pd.to_numeric主要用于将字符串转换为数字值,因此我假设您要转换字符串布尔值的字符串。

考虑数据框df

df = pd.DataFrame([
        ['1', None, 'True'],
        ['False', 2, True]
    ])

print(df)

       0    1     2
0      1  NaN  True
1  False  2.0  True

我的选择
这就是我的建议。在下面,我将其分解,试图解释发生了什么。

def try_eval2(x):
    if type(x) is str:
        try:
            x = literal_eval(x)
        except:
            x = np.nan

    if type(x) is not bool:
        x = np.nan

    return x

vals = df.values
v = vals.ravel()
a = np.array([try_eval2(x) for x in v.tolist()], dtype=object)
pd.DataFrame(a.reshape(vals.shape), df.index, df.columns)

       0    1     2
0    NaN  NaN  True
1  False  NaN  True

<强> 时序
您会注意到我提出的解决方案非常快

%%timeit
vals = df.values
v = vals.ravel()
a = np.array([try_eval2(x) for x in v.tolist()], dtype=object)
pd.DataFrame(a.reshape(vals.shape), df.index, df.columns)
10000 loops, best of 3: 149 µs per loop

%timeit df.astype(str).applymap(to_boolean)
1000 loops, best of 3: 1.28 ms per loop

%timeit df.astype(str).stack().map({'True':True, 'False':False}).unstack()
1000 loops, best of 3: 1.27 ms per loop

解释

第1步
现在,我将使用ast.literal_eval创建一个简单的函数,将字符串转换为值

from ast import literal_eval

def try_eval(x):
    try:
        x = literal_eval(x)
    except:
        pass
    return x

第2步
applymap我的新功能。它会看起来一样!

d1 = df.applymap(try_eval)
print(d1)

       0    1     2
0      1  NaN  True
1  False  2.0  True

第3步
再次使用whereapplymap查找实际值bool

的值
d2 = d1.where(d1.applymap(type).eq(bool))
print(d2)

       0   1     2
0    NaN NaN  True
1  False NaN  True

第4步
您可以删除包含所有NaN

的列
print(d2.dropna(1, 'all'))

       0     2
0    NaN  True
1  False  True

答案 1 :(得分:5)

replace需要whereNaN替换为boolean所有df = df.replace({'True':True,'False':False}) df = df.where(df.applymap(type) == bool)

df

旧解决方案(非常慢):

如果from ast import literal_eval def to_boolean(x): try: x = literal_eval(x) if type(x) == bool: return x else: return np.nan except: x = np.nan return x print (df.astype(str).applymap(to_boolean)) #with borrowing sample from piRSquared 0 1 2 0 NaN NaN True 1 False NaN True 中的某个布尔值,astype的自定义函数和applymap用于转换,则可以ast.literal_eval字符串:

In [76]: %timeit (jez(df))
1 loop, best of 3: 488 ms per loop

In [77]: %timeit (jez2(df))
1 loop, best of 3: 527 ms per loop

#piRSquared fastest solution
In [78]: %timeit (pir(df))
1 loop, best of 3: 5.42 s per loop

#maxu solution
In [79]: %timeit df.astype(str).stack().map({'True':True, 'False':False}).unstack()
1 loop, best of 3: 1.88 s per loop

#jezrael ols solution
In [80]: %timeit df.astype(str).applymap(to_boolean)
1 loop, best of 3: 13.3 s per loop

<强>计时

df = pd.DataFrame([
        ['True', False, '1', 0, None, 5.2],
        ['False', True, '0', 1, 's', np.nan]])

#[20000 rows x 60 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
df = pd.concat([df]*10, axis=1).reset_index(drop=True)
df.columns = pd.RangeIndex(len(df.columns))
#print (df)

时间安排的代码

def to_boolean(x):
    try:
        x = literal_eval(x)
        if type(x) == bool:
            return x
        else:
            return np.nan
    except:
        x = np.nan
    return x


def try_eval2(x):
    if type(x) is str:
        try:
            x = literal_eval(x)
        except:
            x = np.nan

    if type(x) is not bool:
        x = np.nan

    return x
def pir(df):
    vals = df.values
    v = vals.ravel()
    a = np.array([try_eval2(x) for x in v.tolist()], dtype=object)
    df2 = pd.DataFrame(a.reshape(vals.shape), df.index, df.columns)
    return (df2)

def jez(df):
    df = df.replace({'True':True,'False':False})
    df = df.where(df.applymap(type) == bool)
    return (df)

def jez2(df):
    df = df.replace({'True':True,'False':False})
    df = df.where(df.applymap(type).eq(bool))
    return (df)
outb = {1: [2, 3],
  2: [4, 5],
  3: [5, 11, 12],
  4: [6, 7],
  5: [7],
  6: [9, 10],
  7: [8],
  11: [3],
  12: [15, 14, 13],
  13: [17],
  14: [17],
  15: [12, 5, 8, 16],
  17: [18]}

def BFS(v1, v2):
    parsed = []
    toParse = [v1]
    current = v1

    while len(toParse) > 0:

        while current in parsed:
            current = toParse.pop(0)

        if current not in outb:
            return False

        if v2 in outb[current]:
            return True

        toParse += outb[current]
        parsed.append(current)

    return False

答案 2 :(得分:3)

astypepd.to_numeric的更具体版本:

df = df_raw.astype('bool') 

答案 3 :(得分:3)

我使用了@ piRSquared的样本DF:

In [39]: df
Out[39]:
       0    1     2
0      1  NaN  True
1  False  2.0  True

In [40]: df.astype(str).stack().map({'True':True, 'False':False}).unstack()
Out[40]:
       0    1     2
0    NaN  NaN  True
1  False  NaN  True