Question

我有一个熊猫数据框，其中有3列['a'，'b'，'c']。我想根据几个条件在整个数据框上应用一个函数，并对其进行标记，以便在数据框中获得4个新列。我有下面的代码，但是它不起作用，我得到的错误是：

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

，代码为：

if df['a'] is pd.NaT:
    df['is_open'] = df['c']
elif df['b']=='04' or df['b']=='14':
    df['is_wo'] = df['c']
elif (df['b']!='05') and (df['a'] is not pd.NaT):
    df['is_payment'] = df['c']
else:
    df['is_correction'] =  df['c']

您知道我如何应用这些条件吗？注意，条件的顺序很重要。

我想出了这个解决方案，但是在大型数据框上速度很慢：

def get_open_debt_outcome(row):
    if row['a'] is pd.NaT:
        return row['c']
    else:
        return np.nan

def get_wo_outcome(row):
    if pd.isna(row['is_open'])  and (row['b']=='04' or row['b']=='14'):
        return row['c']
    else:
        return np.nan

def get_payment_outcome(row):
    if pd.isna(row['is_open']) and pd.isna(row['is_wo']) and (row['b']!='05') and (row['a'] is not pd.NaT):
        return row['c']
    else:
        return np.nan

def get_correction_outcome(row):
    if pd.isna(row['is_open']) and pd.isna(row['is_wo']) and pd.isna(row['is_payment']):
        return row['c']
    else:
        return np.nan


df['is_open'] = df.apply(lambda x: get_open_debt_outcome(x), axis=1)
df['is_wo'] = df.apply(lambda x: get_wo_outcome(x), axis=1)
df['is_payment'] = df.apply(lambda x: get_payment_outcome(x), axis=1)
df['is_correction'] = df.apply(lambda x: get_correction_outcome(x), axis=1)

解决方案：根据@blacksite的回复

mask = df['a'].isnull()
df['is_open'] = np.where(mask, df['c'], np.nan)

mask = (
    df['is_open'].isnull() &
    ((df['b'] == '04') | (df['b'] == '14'))
)
df['is_wo'] = np.where(mask, df['c'], np.nan)

mask = (
    df['is_open'].isnull() &
    df['is_wo'].isnull() &
    (df['b'] != '05') &
    df['a'].notnull()
)

df['is_payment'] = np.where(mask, df['c'], np.nan)

mask = (
        df['is_open'].isnull() &
        df['is_wo'].isnull() &
        df['is_payment'].isnull() 
    )

df['is_correction'] = np.where(mask, df['c'], np.nan)

Answer 1

这是如何获取'is_wo'列的示例。其余的非常相似：

import numpy as np

# True-False indexing. Vectorized, so much faster than element-wise.
mask = (
    df['is_open'].isnull() &
    ((df['b'] == '04') | (df['b'] == '14'))
)
# numpy.where is basically an ifelse statement, taking a boolean vector as the first argument, and the desired values for true and false as the second and third arguments
df['is_wo'] = np.where(mask, df['c'], np.nan)

pandas.DataFrame.apply通常很慢。

对熊猫数据框应用几个条件

1 个答案: