Question

在以下数据框中，我想了解每行中最后一个值的内容，以及包含最后一个值的列名。

DF

 ID   c1   c4   c3    c2   c8   c7
 1     1    2    2    1    NaN  NaN
 2     1    2    1    NaN  NaN  NaN
 3     1    1    NaN  NaN  2     1

预期输出

 ID     Colname    lastValue
 1        c2         1
 2        c3         1
 3        c7         1

我的代码只能找到最后一个值

df['lastValue'] = df.ffill(axis = 1).iloc[:, -1]

我如何找到colname？

谢谢！

Answer 1

last_valid_index + lookup

s=df.apply(pd.Series.last_valid_index, 1)
df['Last']=df.lookup(s.index,s)
df['Col']=s
df
Out[49]: 
   ID  c1  c4   c3   c2   c8   c7  Last Col
0   1   1   2  2.0  1.0  NaN  NaN   1.0  c2
1   2   1   2  1.0  NaN  NaN  NaN   1.0  c3
2   3   1   1  NaN  NaN  2.0  1.0   1.0  c7

Answer 2

沿第一轴取argmax面罩的notnull：

i = np.argmax(df.notnull().cumsum(1), axis=1)

或者，

i = (~np.isnan(df.values)).cumsum(1).argmax(1)  # pure numpy

现在，

df.columns[i]
Index(['c2', 'c3', 'c7'], dtype='object')

和

df.values[np.arange(len(df)), i]
array([1., 1., 1.])

把它放在一起，

pd.DataFrame({
     'ID' : df.ID, 
     'Colname' : df.columns[i], 
     'lastValue' : df.values[np.arange(len(df)), i]
})

   ID Colname  lastValue
0   1      c3        2.0
1   2      c1        1.0
2   3      c1        1.0

Answer 3

这可以通过numpy实现。 argmax algorithm由@piRSquared提供。

A = df.values

idx = A.shape[1] - (~np.isnan(A))[:, ::-1].argmax(1) - 1
cols = df.columns[idx]

res = pd.DataFrame({'ID': df['ID'], 'col': cols,
                    'last': A[range(A.shape[0]), idx]})

#    ID col  last
# 0   1  c2   1.0
# 1   2  c3   1.0
# 2   3  c7   1.0

效果基准

import random
import pandas as pd

%timeit cs(df)   # 10 loops, best of 3: 63.5 ms per loop
%timeit jp(df)   # 100 loops, best of 3: 2.76 ms per loop
%timeit wen(df)  # 10 loops, best of 3: 346 ms per loop

# create dataframe with randomised np.nan

df = pd.DataFrame(np.random.randint(0, 9, (1000, 1000)), dtype=float)
df = df.rename(columns={0: 'ID'})
ix = [(row, col) for row in range(df.shape[0]) for col in range(df.shape[1])]
for row, col in random.sample(ix, int(round(.1*len(ix)))):
    df.iat[row, col] = np.nan

def jp(df):
    A = df.values

    idx = A.shape[1] - (~np.isnan(A))[:, ::-1].argmax(1) - 1
    cols = df.columns[idx]

    res = pd.DataFrame({'ID': df['ID'], 'col': cols,
                        'last': A[range(A.shape[0]), idx]})

    return df

def wen(df):

    s=df.apply(pd.Series.last_valid_index, 1)
    df['Last']=df.lookup(s.index,s)
    df['Col']=s

    return df

def cs(df):
    i = (~np.isnan(df.values)).cumsum(1).argmax(1)  # pure numpy

    df = pd.DataFrame({
         'ID' : df.ID, 
         'Colname' : df.columns[i], 
         'lastValue' : df.values[np.arange(len(df)), i]
    })
    return df

在数据框中查找最后一个值和相应的列名称

3 个答案: