详细

Question

这是当前的数据帧：

> ID        Date    current
> 2001980   10/30/2017  1   
> 2001980   10/29/2017  0   
> 2001980   10/28/2017  0   
> 2001980   10/27/2017  40  
> 2001980   10/26/2017  39  
> 2001980   10/25/2017  0   
> 2001980   10/24/2017  0   
> 2001980   10/23/2017  60  
> 2001980   10/22/2017  0   
> 2001980   10/21/2017  0   
> 2002222   10/21/2017  0   
> 2002222   10/20/2017  0   
> 2002222   10/19/2017  16  
> 2002222   10/18/2017  0   
> 2002222   10/17/2017  0   
> 2002222   10/16/2017  20  
> 2002222   10/15/2017  19  
> 2002222   10/14/2017  18

以下是最终数据框。列expected就是我想要的。

一个ID可能有多个日期/记录/行。（ID +日期）是唯一的。
此行的预期值=最后一行的预期值 - 1
最小值为0.
基于2中的公式，如果该行的预期值<1。此行的当前值，然后使用此行的当前值。例如，在2017年10月23日的ID 2001980。根据规则2，该值应为36，但根据规则4,36 <60，因此我们使用60。

非常感谢你。

> ID        Date    current expected 
> 2001980   10/30/2017  1   1 
> 2001980   10/29/2017  0   0
> 2001980   10/28/2017  0   0 
> 2001980   10/27/2017  40  40
> 2001980   10/26/2017  39  39 
> 2001980   10/25/2017  0   38
> 2001980   10/24/2017  0   37 
> 2001980   10/23/2017  60  60
> 2001980   10/22/2017  0   59 
> 2001980   10/21/2017  0   58
> 2002222   10/21/2017  0   0
> 2002222   10/20/2017  0   0 
> 2002222   10/19/2017  16  16
> 2002222   10/18/2017  0   15 
> 2002222   10/17/2017  0   14
> 2002222   10/16/2017  20  20
> 2002222   10/15/2017  19  19
> 2002222   10/14/2017  18  18

我使用的Excel使用以下公式：

= if（此行的ID =最后一行的ID， max（最后一行的预期值 - 1，此行的当前值），这一行的当前值）

Answer 1

修改更简单：

df['expected'] = df.groupby(['ID',df.current.ne(0).cumsum()])['current']\
  .transform(lambda x: x.eq(0).cumsum().mul(-1).add(x.iloc[0])).clip(0,np.inf)

让我们玩得开心：

df['expected'] = (df.groupby('ID')['current'].transform(lambda x: x.where(x.ne(0)).ffill()) +
df.groupby(['ID',df.current.ne(0).cumsum()])['current'].transform(lambda x: x.eq(0).cumsum()).mul(-1))\
.clip(0,np.inf).fillna(0).astype(int)
print(df)

输出：

         ID        Date  current  expected
0   2001980  10/30/2017        1         1
1   2001980  10/29/2017        0         0
2   2001980  10/28/2017        0         0
3   2001980  10/27/2017       40        40
4   2001980  10/26/2017       39        39
5   2001980  10/25/2017        0        38
6   2001980  10/24/2017        0        37
7   2001980  10/23/2017       60        60
8   2001980  10/22/2017        0        59
9   2001980  10/21/2017        0        58
10  2002222  10/21/2017        0         0
11  2002222  10/20/2017        0         0
12  2002222  10/19/2017       16        16
13  2002222  10/18/2017        0        15
14  2002222  10/17/2017        0        14
15  2002222  10/16/2017       20        20
16  2002222  10/15/2017       19        19
17  2002222  10/14/2017       18        18

详细

基本上，创建系列，s1和减去系列s2然后剪切负值并用零填充nan。

#Let's calculate two series first a series to fill the zeros in an 'ID' with the previous non-zero value 
s1 = df.groupby('ID')['current'].transform(lambda x: x.where(x.ne(0)).ffill())
s1

输出：

0      1.0
1      1.0
2      1.0
3     40.0
4     39.0
5     39.0
6     39.0
7     60.0
8     60.0
9     60.0
10     NaN
11     NaN
12    16.0
13    16.0
14    16.0
15    20.0
16    19.0
17    18.0
Name: current, dtype: float64

#Second series is a cumulative count of zeroes in a group by 'ID'
s2 = df.groupby(['ID',df.current.ne(0).cumsum()])['current'].transform(lambda x: x.eq(0).cumsum()).mul(-1)
s2

输出：

0     0
1    -1
2    -2
3     0
4     0
5    -1
6    -2
7     0
8    -1
9    -2
10   -1
11   -2
12    0
13   -1
14   -2
15    0
16    0
17    0
Name: current, dtype: int32

将剪辑和fillna添加到一起。

(s1 + s2).clip(0, np.inf).fillna(0)

输出：

0      1.0
1      0.0
2      0.0
3     40.0
4     39.0
5     38.0
6     37.0
7     60.0
8     59.0
9     58.0
10     0.0
11     0.0
12    16.0
13    15.0
14    14.0
15    20.0
16    19.0
17    18.0
Name: current, dtype: float64

Answer 2

因此，您可以使用apply和nested functions

执行此操作

import pandas as pd
ID = [2001980,2001980,2001980,2001980,2001980,2001980,2001980,2001980,2001980,2001980,2002222,2002222,2002222,2002222,2002222,2002222,2002222,2002222,]
Date = ["10/30/2017","10/29/2017","10/28/2017","10/27/2017","10/26/2017","10/25/2017","10/24/2017","10/23/2017","10/22/2017","10/21/2017","10/21/2017","10/20/2017","10/19/2017","10/18/2017","10/17/2017","10/16/2017","10/15/2017","10/14/2017",]
current = [1 ,0 ,0 ,40,39,0 ,0 ,60,0 ,0 ,0 ,0 ,16,0 ,0 ,20,19,18,]

df = pd.DataFrame({"ID": ID, "Date": Date, "current": current})

然后创建更新框架的功能

Python 3.X

def update_frame(df):
    last_expected = None
    def apply_logic(row):
        nonlocal last_expected
        last_row_id = row.name - 1
        if row.name == 0:
            last_expected = row["current"]
            return last_expected
        last_row = df.iloc[[last_row_id]].iloc[0].to_dict()
        last_expected = max(last_expected-1,row['current']) if last_row['ID'] == row['ID'] else row['current']
        return last_expected
    return apply_logic

Python 2.X

def update_frame(df):
    sd = {"last_expected": None}
    def apply_logic(row):
        last_row_id = row.name - 1
        if row.name == 0:
            sd['last_expected'] = row["current"]
            return sd['last_expected']
        last_row = df.iloc[[last_row_id]].iloc[0].to_dict()
        sd['last_expected'] = max(sd['last_expected'] - 1,row['current']) if last_row['ID'] == row['ID'] else row['current']
        return sd['last_expected']
    return apply_logic

运行如下功能

df['expected'] = df.apply(update_frame(df), axis=1)

输出符合预期

Answer 3

编辑：解决OP关于扩展到数百万行的问题。

是的，我的原始答案不会扩展到非常大的数据帧。但是，通过少量编辑，这种易于阅读的解决方案将进行扩展。随后的优化利用了Numba中的JIT编译器。在导入Numba之后，我添加了jit装饰器并修改了函数以在numpy数组上操作而不是pandas对象。 Numba意识到numpy，无法优化pandas对象。

import numba

@numba.jit
def expected(id_col, current_col):
    lexp = []
    lstID = 0
    expected = 0
    for i in range(len(id_col)):
        id, current = id_col[i], current_col[i]
        if id == lstID:
            expected = max(current, max(expected - 1, 0))
        else:
            expected = current
        lexp.append(expected)
        lstID = id
    return np.array(lexp)

要将numpy数组传递给函数，请使用pandas系列的.values属性。

df1['expected'] = expected(df1.ID.values, df1.current.values)

为了测试性能，我将原始数据帧扩展到超过100万行。

df1 = df
while len(df1) < 1000000:
    df1 = pd.concat([df1, df1])
df1.reset_index(inplace=True, drop=True)

新的变化表现非常好。

%timeit expected(df1.ID.values, df1.current.values)
44.9 ms ± 249 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

df1.shape
Out[65]: (1179648, 4)

df1.tail(15)
Out[66]: 
              ID        Date  current  expected
1179633  2001980  10/27/2017       40        40
1179634  2001980  10/26/2017       39        39
1179635  2001980  10/25/2017        0        38
1179636  2001980  10/24/2017        0        37
1179637  2001980  10/23/2017       60        60
1179638  2001980  10/22/2017        0        59
1179639  2001980  10/21/2017        0        58
1179640  2002222  10/21/2017        0         0
1179641  2002222  10/20/2017        0         0
1179642  2002222  10/19/2017       16        16
1179643  2002222  10/18/2017        0        15
1179644  2002222  10/17/2017        0        14
1179645  2002222  10/16/2017       20        20
1179646  2002222  10/15/2017       19        19
1179647  2002222  10/14/2017       18        18

原始答案

有点蛮力，但很容易理解。

    def expected(df):
        lexp = []
        lstID = None
        expected = 0
        for i in range(len(df)):
            id, current = df[['ID', 'current']].iloc[i]
            if id == lstID:
                expected = max(expected - 1, 0)
                expected = max(current, expected)
            else:
                expected = current
            lexp.append(expected)
            lstID = id
        return pd.Series(lexp)

输出

df['expected'] = expected(df)

df
Out[53]: 
         ID        Date  current  expected
0   2001980  10/30/2017        1         1
1   2001980  10/29/2017        0         0
2   2001980  10/28/2017        0         0
3   2001980  10/27/2017       40        40
4   2001980  10/26/2017       39        39
5   2001980  10/25/2017        0        38
6   2001980  10/24/2017        0        37
7   2001980  10/23/2017       60        60
8   2001980  10/22/2017        0        59
9   2001980  10/21/2017        0        58
10  2002222  10/21/2017        0         0
11  2002222  10/20/2017        0         0
12  2002222  10/19/2017       16        16
13  2002222  10/18/2017        0        15
14  2002222  10/17/2017        0        14
15  2002222  10/16/2017       20        20
16  2002222  10/15/2017       19        19
17  2002222  10/14/2017       18        18

Answer 4

我相信@Tarun Lalwani指出了一个正确的方向。那就是在DataFrame之外保存一些关键信息。虽然代码可以简化，但只要正确管理名称，使用全局变量没有任何问题。它是一种设计模式，通常可以使事情变得更简单并提高可读性。

cached_last = { 'expected': None, 'ID': None }

def set_expected(x):
    if cached_last['ID'] is None or x.ID != cached_last['ID']:
        expected = x.current
    else:
        expected =  max(cached_last['expected'] - 1, x.current)
    cached_last['ID'] = x.ID
    cached_last['expected'] = expected
    return expected

df['expected'] = df.apply(set_expected, axis=1)

从pandas.DataFrame.apply的文档中，请注意apply函数的潜在副作用。

In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first column/row.

Answer 5

这里的逻辑应该是工作

lst=[]

for _, y in df.groupby('ID'):
    z=[]
    for i,(_, x) in enumerate(y.iterrows()):
        print(x)
        if x['current'] > 0:
           z.append(x['current'])
        else:
            try:
               z.append(max(z[i-1]-1,0))
            except:
               z.append(0)

    lst.extend(z)


lst

Out[484]: [1, 0, 0, 40, 39, 38, 37, 60, 59, 58, 0, 0, 16, 15, 14, 20, 19, 18]

python - 使用上一行的值来更新新行值

5 个答案:

详细

基本上，创建系列，s1和减去系列s2然后剪切负值并用零填充nan。

将剪辑和fillna添加到一起。