Question

我有一个包含几列的数据框，需要根据条件填充各种列。我写了一个函数，并且一直使用df.apply，但这显然非常慢。我正在寻求帮助，以创建更快的方式来执行以下操作：

def function(df):
    if pd.isnull(df['criteria_column']) == True:
        return df['return_column']
    else:
        return
df['new_column'] = df.apply(function, axis=1)

我想做类似的事情：

 df['new_column'] = np.where(pd.isnull(df['criteria_column'] == True),
                                       df['return_column'], "")

然而，这导致ValueError: Could not construct Timestamp from argument <type 'bool'>

Answer 1

使用索引而不是应用，它的速度要快得多：

df["new_column"] = ""
is_null = pd.isnull(df["criteria_column"])
df["new_column"][is_null] = df["return_column"][is_null] # method 1

为了便于参考，这里有一些与最后一行做同样事情的方法：

df["new_column"][is_null] = df["return_column"][is_null] # method 1
df["new_column"].loc[is_null] = df.loc["return_column"].loc[is_null] # method 2
df.loc[is_null, "new_column"] = df.loc[is_null, "return_column"] # method 3, thanks @joris

对于那些好奇的人，方法1和2访问作为列的pandas.Series，并对它们进行选择的赋值。请特别注意，series[is_null]最终会在此实例中最终调用series.loc[is_null]。

最后，方法3是一种方法2的便捷方法，它可以消除可能的歧义，减少使用的内存，并允许在连续链接后进行分配。如果您正在进行复杂的选择链接并且不想要中间副本或想要分配给选择，那么该方法可能会更好。见pandas documentation

在pandas数据帧中使用np.where或其他广播技术

1 个答案: