Question

我有一个示例数据框显示如下。对于每一行，我想首先检查c1，如果它不为null，则检查c2。通过这种方式，找到第一个notnull列并将该值存储到列结果。

ID  c1  c2  c3  c4  result
1   a   b           a
2       cc  dd      cc
3           ee  ff  ee
4               gg  gg

我现在正在使用这种方式。但我想知道是否有更好的方法。（列名没有任何模式，这只是样本）

df["result"] = np.where(df["c1"].notnull(), df["c1"], None)
df["result"] = np.where(df["result"].notnull(), df["result"], df["c2"])
df["result"] = np.where(df["result"].notnull(), df["result"], df["c3"])
df["result"] = np.where(df["result"].notnull(), df["result"], df["c4"])
df["result"] = np.where(df["result"].notnull(), df["result"], "unknown)

当有很多列时，此方法看起来不太好。

Answer 1

首先使用回填NaN，然后按iloc选择第一列：

df['result'] = df[['c1','c2','c3','c4']].bfill(axis=1).iloc[:, 0].fillna('unknown')

或者：

df['result'] = df.iloc[:, 1:].bfill(axis=1).iloc[:, 0].fillna('unknown')

print (df)
   ID   c1   c2  c3   c4 result
0   1    a    b   a  NaN      a
1   2  NaN   cc  dd   cc     cc
2   3  NaN   ee  ff   ee     ee
3   4  NaN  NaN  gg   gg     gg

<强>性能：

df = pd.concat([df] * 1000, ignore_index=True)


In [220]: %timeit df['result'] = df[['c1','c2','c3','c4']].bfill(axis=1).iloc[:, 0].fillna('unknown')
100 loops, best of 3: 2.78 ms per loop

In [221]: %timeit df['result'] = df.iloc[:, 1:].bfill(axis=1).iloc[:, 0].fillna('unknown')
100 loops, best of 3: 2.7 ms per loop

#jpp solution
In [222]: %%timeit
     ...: cols = df.iloc[:, 1:].T.apply(pd.Series.first_valid_index)
     ...: 
     ...: df['result'] = [df.loc[i, cols[i]] for i in range(len(df.index))]
     ...: 
1 loop, best of 3: 180 ms per loop

#cᴏʟᴅsᴘᴇᴇᴅ'  s solution
In [223]: %timeit df['result'] = df.stack().groupby(level=0).first()
1 loop, best of 3: 606 ms per loop

Answer 2

<强>设置

df = df.set_index('ID') # if necessary
df
     c1   c2  c3   c4
ID                   
1     a    b   a  NaN
2   NaN   cc  dd   cc
3   NaN   ee  ff   ee
4   NaN  NaN  gg   gg

<强>解决方案
stack + groupby + first
stack隐式删除NaN，因此groupby.first保证在第一个非空值时为其提供。重新分配结果将显示缺失索引处的任何NaN，您可以fillna进行后续调用。

df['result'] = df.stack().groupby(level=0).first()
# df['result'] = df['result'].fillna('unknown') # if necessary 
df
     c1   c2  c3   c4 result
ID                          
1     a    b   a  NaN      a
2   NaN   cc  dd   cc     cc
3   NaN   ee  ff   ee     ee
4   NaN  NaN  gg   gg     gg

（注意，这对于较大的数据帧来说速度很慢，因为你可能会使用@ jezrael的解决方案）

Answer 3

我正在使用lookup和来自Jpp的数据

df=df.set_index('ID')
s=df.ne('').idxmax(1)
df['Result']=df.lookup(s.index,s)
df
Out[492]: 
   c1  c2  c3  c4 Result
ID                      
1   a   b              a
2      cc  dd         cc
3          ee  ff     ee
4              gg     gg

Answer 4

一种方法是使用pd.DataFrame.lookup并将pd.Series.first_valid_index应用于转置数据框：

var icollections = new List<string>();

foreach (PropertyInfo property in
    typeof(YourProjectName.ContextObject).Assembly
    .GetType("YourProjectName.Entities." + Model.ModelTypeName)
    .GetProperties())
{
    if (property.PropertyType.IsGenericType 
        && property.PropertyType.GetGenericTypeDefinition() == typeof(ICollection<>))
    {
        icollections.Add(property.Name);
    }
}

每行获取第一个非空值

4 个答案: