Question

我有一个Dataframe：

Columns = "Chronological Months"
Index = "Customer ID's"
Data = "Dollars Spent By Customer"

我想创建一个新列，指示每个客户端处于非活动状态的连续月数（与最近一个月花费$的值为0的人）。我对过去6个月只感兴趣。

我可以想到一些非常低效的方法（例如，应用于向量的一串IF语句），但我希望避免它们。

我想象的是下面的图像。

任何帮助将不胜感激。谢谢！

Answer 1

将bfill与axis=1一起使用（沿着列）和isnull然后使用sum(axis=1)来计算

In [14]: df.bfill(axis=1).isnull().sum(axis=1)
Out[14]:
Cusomter 1     5
Cusomter 2     6
Cusomter 3     1
Cusomter 4     5
Cusomter 5     0
Cusomter 6     3
Cusomter 7     6
Cusomter 8     2
Cusomter 9     3
Cusomter 10    0
dtype: int64

In [15]: df['Months of Inactivity'] = df.bfill(axis=1).isnull().sum(axis=1)

In [16]: df
Out[16]:
               Jan    Feb    Mar  April    Mat   June  Months of Inactivity
Cusomter 1   300.0    NaN    NaN    NaN    NaN    NaN                     5
Cusomter 2     NaN    NaN    NaN    NaN    NaN    NaN                     6
Cusomter 3     NaN  100.0    NaN    NaN  100.0    NaN                     1
Cusomter 4   300.0    NaN    NaN    NaN    NaN    NaN                     5
Cusomter 5     NaN    NaN    NaN    NaN    NaN  300.0                     0
Cusomter 6     NaN    NaN  200.0    NaN    NaN    NaN                     3
Cusomter 7     NaN    NaN    NaN    NaN    NaN    NaN                     6
Cusomter 8   100.0    NaN    NaN  100.0    NaN    NaN                     2
Cusomter 9     NaN    NaN  400.0    NaN    NaN    NaN                     3
Cusomter 10  300.0    NaN    NaN  200.0  100.0  100.0                     0

如果空白单元格为-个连字符，请使用replace

In [31]: df
Out[31]:
             Jan  Feb  Mar April  Mat June
Cusomter 1   300    -    -     -    -    -
Cusomter 2     -    -    -     -    -    -
Cusomter 3     -  100    -     -  100    -
Cusomter 4   300    -    -     -    -    -
Cusomter 5     -    -    -     -    -  300
Cusomter 6     -    -  200     -    -    -
Cusomter 7     -    -    -     -    -    -
Cusomter 8   100    -    -   100    -    -
Cusomter 9     -    -  400     -    -    -
Cusomter 10  300    -    -   200  100  100

In [32]: df['Inactivity'] = df.replace('-', np.nan).bfill(axis=1).isnull().sum(axis=1)

In [33]: df
Out[33]:
             Jan  Feb  Mar April  Mat June  Inactivity
Cusomter 1   300    -    -     -    -    -           5
Cusomter 2     -    -    -     -    -    -           6
Cusomter 3     -  100    -     -  100    -           1
Cusomter 4   300    -    -     -    -    -           5
Cusomter 5     -    -    -     -    -  300           0
Cusomter 6     -    -  200     -    -    -           3
Cusomter 7     -    -    -     -    -    -           6
Cusomter 8   100    -    -   100    -    -           2
Cusomter 9     -    -  400     -    -    -           3
Cusomter 10  300    -    -   200  100  100           0

Answer 2

或者您可以尝试last_valid_index

d['Months of Inactivity']=6-d.apply(pd.Series.last_valid_index, axis=1).map(dict(zip(list(d), list(range(1,d.shape[1]+1))))).fillna(0)
d
Out[221]: 
              Jan    Feb    Mar  April    Mat   June  Months of Inactivity
Cusomter1   300.0    NaN    NaN    NaN    NaN    NaN                   5.0
Cusomter2     NaN    NaN    NaN    NaN    NaN    NaN                   6.0
Cusomter3     NaN  100.0    NaN    NaN  100.0    NaN                   1.0
Cusomter4   300.0    NaN    NaN    NaN    NaN    NaN                   5.0
Cusomter5     NaN    NaN    NaN    NaN    NaN  300.0                   0.0
Cusomter6     NaN    NaN  200.0    NaN    NaN    NaN                   3.0
Cusomter7     NaN    NaN    NaN    NaN    NaN    NaN                   6.0
Cusomter8   100.0    NaN    NaN  100.0    NaN    NaN                   2.0
Cusomter9     NaN    NaN  400.0    NaN    NaN    NaN                   3.0
Cusomter10  300.0    NaN    NaN  200.0  100.0  100.0                   0.0

Answer 3

如果速度至关重要，你可以降低到numpy并加速接近两个数量级。

a=np.where(df.values != '-', 1, 0)
np.append(a[:, ::-1], np.ones((len(a),1)), axis=1).argmax(axis=1)

array([5, 6, 1, 5, 0, 3, 6, 2, 3, 0])

速度测试

%%timeit
a=np.where(df.values != '-', 1, 0)
np.append(a[:, ::-1], np.ones((len(a),1)), axis=1).argmax(axis=1)
24.4 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit df.replace('-', np.nan).bfill(axis=1).isnull().sum(axis=1)
1.91 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Python + Pandas - 确定几个月的不活动

3 个答案: