Question

我经常需要一个新列，这是我可以从其他列中获得的最佳列，并且我有一个特定的首选项优先级列表。我愿意采取第一个非空值。

def coalesce(values):
    not_none = (el for el in values if el is not None)
    return next(not_none, None)

df = pd.DataFrame([{'third':'B','first':'A','second':'C'},
                   {'third':'B','first':None,'second':'C'},
                   {'third':'B','first':None,'second':None},                   
                   {'third':None,'first':None,'second':None},
                   {'third':'B','first':'A','second':None}])

df['combo1'] = df.apply(coalesce, axis=1)
df['combo2'] = df[['second','third','first']].apply(coalesce, axis=1)
print df

结果

  first second third combo1 combo2
0     A      C     B      A      C
1  None      C     B      C      C
2  None   None     B      B      B
3  None   None  None   None   None
4     A   None     B      A      B

这段代码有效（结果是我想要的），但速度不是很快如果我需要[[＆＃39;第二次＆＃39;第三次＆＃39;第一次＆＃39;]]

，我可以选择我的优先事项

Coalesce有点像tsql中同名的功能我怀疑我可能忽略了一个简单的方法来实现它，在大型DataFrames上有良好的性能（+400,000行）

我知道有很多方法可以填补我经常在轴= 0上使用的缺失数据这就是让我觉得我可能错过了一个简单的选项，即轴= 1

你能否提出更好/更快的建议......或者确认这是好的。

Answer 1

等同于COALESCE的Pandas是方法fillna()：

result = column_a.fillna(column_b)

结果是一列，如果该列提供非空值，则每个值取自column_a，否则该值取自column_b。因此，您的combo1可以通过以下方式生成：

df['first'].fillna(df['second']).fillna(df['third'])

，并提供：

您的combo2可以通过以下方式制作：

(df['second']).fillna(df['third']).fillna(df['first'])

返回新列：

如果您想要一个名为coalesce的高效操作，它可以简单地从左到右组合fillna()列，然后返回结果：

def coalesce(df, column_names):
    i = iter(column_names)
    column_name = next(i)
    answer = df[column_name]
    for column_name in i:
        answer = answer.fillna(df[column_name])
    return answer

print coalesce(df, ['first', 'second', 'third'])
print coalesce(df, ['second', 'third', 'first'])

给出：

Answer 2

您可以使用None查找null - 在这种情况下In [169]: pd.isnull(df) Out[169]: first second third 0 False False False 1 True False False 2 True True False 3 True True True 4 False True False - 值：

np.argmin

然后使用np.argmin查找第一个非null值的索引。如果所有值都为null，则In [186]: np.argmin(pd.isnull(df).values, axis=1) Out[186]: array([0, 1, 2, 0, 0])返回0：

df

然后，您可以使用NumPy整数索引从In [193]: df.values[np.arange(len(df)), np.argmin(pd.isnull(df).values, axis=1)] Out[193]: array(['A', 'C', 'B', None, 'A'], dtype=object)中选择所需的值：

import pandas as pd
df = pd.DataFrame([{'third':'B','first':'A','second':'C'},
                   {'third':'B','first':None,'second':'C'},
                   {'third':'B','first':None,'second':None},                   
                   {'third':None,'first':None,'second':None},
                   {'third':'B','first':'A','second':None}])

mask = pd.isnull(df).values
df['combo1'] = df.values[np.arange(len(df)), np.argmin(mask, axis=1)]
order = np.array([1,2,0])
mask = mask[:, order]
df['combo2'] = df.values[np.arange(len(df)), order[np.argmin(mask, axis=1)]]

例如，

  first second third combo1 combo2
0     A      C     B      A      C
1  None      C     B      C      C
2  None   None     B      B      B
3  None   None  None   None   None
4     A   None     B      A      B

产量

df3.apply(coalesce, ...)

如果DataFrame有很多行，那么使用argmin代替df2 = pd.concat([df]*1000) In [230]: %timeit mask = pd.isnull(df2).values; df2.values[np.arange(len(df2)), np.argmin(mask, axis=1)] 1000 loops, best of 3: 617 µs per loop In [231]: %timeit df2.apply(coalesce, axis=1) 10 loops, best of 3: 84.1 ms per loop要快得多：

{{1}}

Answer 3

df1 = pd.DataFrame([{'third':'B','first':'A','second':'C'},
                   {'third':'B','first':None,'second':'C'},
                   {'third':'B','first':None,'second':None},                   
                   {'third':None,'first':None,'second':None},
                   {'third':'B','first':'A','second':None}])

df1['combo'] = df1[['second','third','first']].bfill(axis ='columns')["second"]
print(df1)

结果

  third first second combo
0     B     A      C     C
1     B  None      C     C
2     B  None   None     B
3  None  None   None  None
4     B     A   None     B

是否有更好的可读方式来在pandas中coalese列

3 个答案: