Question

我有一个熊猫数据框，其中包含3列，每列包含一个用户在会话期间访问过的网站。

在某些情况下，用户可能在单个会话中没有访问3个站点。用0表示，表示未访问任何站点。

import pandas as pd

df = pd.DataFrame(data=[[5, 8, 1],[8,0,0],[1,17,0]], 
                  columns=['site1', 'site2', 'site3'])
print(df)

   site1  site2  site3
0      5      8      1
1      8      0      0
2      1     17      0

在上面的示例中，用户0访问了站点5、8和1。用户1仅访问了站点8，而用户2访问了站点1和17。

我想创建一个新列last_site，该列显示用户在该会话中访问的最后一个站点。

我想要的结果是这样：

   site1  site2  site3  last_site
0      5      8      1          1
1      8      0      0          8
2      1     17      0         17

如何使用熊猫简洁地做到这一点？

Answer 1

使用通过替换0值创建的misisng值的前填充，然后用iloc选择最后一列：

df['last'] = df.replace(0, np.nan).ffill(axis=1).iloc[:, -1].astype(int)
print (df)
   site1  site2  site3  last
0      5      8      1     1
1      8      0      0     8
2      1     17      0    17

如果性能很重要，可以使用numpy：

a = df.values
m = a != 0

df['last'] = a[np.arange(m.shape[0]), m.shape[1]-m[:,::-1].argmax(1)-1]
print (df)
   site1  site2  site3  last
0      5      8      1     1
1      8      0      0     8
2      1     17      0    17

Answer 2

代码：

df['last_site'] = df.apply(lambda x: x.iloc[x.nonzero()].iloc[-1], axis=1)

输出：

   site1  site2  site3  last_site
0      5      8      1          1
1      8      0      0          8
2      1     17      0         17

Answer 3

`mask` + `ffill`

“纯熊猫”解决方案：

df['last'] = df.mask(df.eq(0)).ffill(1).iloc[:, -1].astype(int)

`numba`

要获得大量行/列的效率，numba会有所帮助。要了解为什么它比argmax更好，请参见Efficiently return the index of the first value satisfying condition in array。

from numba import njit

@njit
def get_last_val(A):
    m, n = A.shape
    res = A[:, -1]
    for i in range(m):
        for j in range(n):
            if A[i, j] == 0:
                res[i] = A[i, max(0, j-1)]
                break
    return res

df['last'] = get_last_val(df.values)

熊猫数据框获取最后一个非零列的值

3 个答案:

`mask` + `ffill`

`numba`

熊猫数据框获取最后一个非零列的值

3 个答案:

mask + ffill

numba

`mask` + `ffill`

`numba`