python - 使用上一行的值来更新新行值

时间:2018-03-02 17:13:22

标签: python pandas dataframe

这是当前的数据帧:

> ID        Date    current
> 2001980   10/30/2017  1   
> 2001980   10/29/2017  0   
> 2001980   10/28/2017  0   
> 2001980   10/27/2017  40  
> 2001980   10/26/2017  39  
> 2001980   10/25/2017  0   
> 2001980   10/24/2017  0   
> 2001980   10/23/2017  60  
> 2001980   10/22/2017  0   
> 2001980   10/21/2017  0   
> 2002222   10/21/2017  0   
> 2002222   10/20/2017  0   
> 2002222   10/19/2017  16  
> 2002222   10/18/2017  0   
> 2002222   10/17/2017  0   
> 2002222   10/16/2017  20  
> 2002222   10/15/2017  19  
> 2002222   10/14/2017  18  

以下是最终数据框。列expected就是我想要的。

  1. 一个ID可能有多个日期/记录/行。 (ID +日期)是唯一的。
  2. 此行的预期值=最后一行的预期值 - 1
  3. 最小值为0.
  4. 基于2中的公式,如果该行的预期值<1。此行的当前值,然后使用此行的当前值。例如,在2017年10月23日的ID 2001980。根据规则2,该值应为36,但根据规则4,36 <60,因此我们使用60。
  5. 非常感谢你。

    > ID        Date    current expected 
    > 2001980   10/30/2017  1   1 
    > 2001980   10/29/2017  0   0
    > 2001980   10/28/2017  0   0 
    > 2001980   10/27/2017  40  40
    > 2001980   10/26/2017  39  39 
    > 2001980   10/25/2017  0   38
    > 2001980   10/24/2017  0   37 
    > 2001980   10/23/2017  60  60
    > 2001980   10/22/2017  0   59 
    > 2001980   10/21/2017  0   58
    > 2002222   10/21/2017  0   0
    > 2002222   10/20/2017  0   0 
    > 2002222   10/19/2017  16  16
    > 2002222   10/18/2017  0   15 
    > 2002222   10/17/2017  0   14
    > 2002222   10/16/2017  20  20
    > 2002222   10/15/2017  19  19
    > 2002222   10/14/2017  18  18
    

    我使用的Excel使用以下公式:

    = if(此行的ID =最后一行的ID,     max(最后一行的预期值 - 1,此行的当前值),     这一行的当前值)

5 个答案:

答案 0 :(得分:4)

修改更简单:

df['expected'] = df.groupby(['ID',df.current.ne(0).cumsum()])['current']\
  .transform(lambda x: x.eq(0).cumsum().mul(-1).add(x.iloc[0])).clip(0,np.inf)

让我们玩得开心:

df['expected'] = (df.groupby('ID')['current'].transform(lambda x: x.where(x.ne(0)).ffill()) +
df.groupby(['ID',df.current.ne(0).cumsum()])['current'].transform(lambda x: x.eq(0).cumsum()).mul(-1))\
.clip(0,np.inf).fillna(0).astype(int)
print(df)

输出:

         ID        Date  current  expected
0   2001980  10/30/2017        1         1
1   2001980  10/29/2017        0         0
2   2001980  10/28/2017        0         0
3   2001980  10/27/2017       40        40
4   2001980  10/26/2017       39        39
5   2001980  10/25/2017        0        38
6   2001980  10/24/2017        0        37
7   2001980  10/23/2017       60        60
8   2001980  10/22/2017        0        59
9   2001980  10/21/2017        0        58
10  2002222  10/21/2017        0         0
11  2002222  10/20/2017        0         0
12  2002222  10/19/2017       16        16
13  2002222  10/18/2017        0        15
14  2002222  10/17/2017        0        14
15  2002222  10/16/2017       20        20
16  2002222  10/15/2017       19        19
17  2002222  10/14/2017       18        18

详细

基本上,创建系列,s1和减去系列s2然后剪切负值并用零填充nan。

#Let's calculate two series first a series to fill the zeros in an 'ID' with the previous non-zero value 
s1 = df.groupby('ID')['current'].transform(lambda x: x.where(x.ne(0)).ffill())
s1

输出:

0      1.0
1      1.0
2      1.0
3     40.0
4     39.0
5     39.0
6     39.0
7     60.0
8     60.0
9     60.0
10     NaN
11     NaN
12    16.0
13    16.0
14    16.0
15    20.0
16    19.0
17    18.0
Name: current, dtype: float64

#Second series is a cumulative count of zeroes in a group by 'ID'
s2 = df.groupby(['ID',df.current.ne(0).cumsum()])['current'].transform(lambda x: x.eq(0).cumsum()).mul(-1)
s2

输出:

0     0
1    -1
2    -2
3     0
4     0
5    -1
6    -2
7     0
8    -1
9    -2
10   -1
11   -2
12    0
13   -1
14   -2
15    0
16    0
17    0
Name: current, dtype: int32

将剪辑和fillna添加到一起。

(s1 + s2).clip(0, np.inf).fillna(0)

输出:

0      1.0
1      0.0
2      0.0
3     40.0
4     39.0
5     38.0
6     37.0
7     60.0
8     59.0
9     58.0
10     0.0
11     0.0
12    16.0
13    15.0
14    14.0
15    20.0
16    19.0
17    18.0
Name: current, dtype: float64

答案 1 :(得分:3)

因此,您可以使用applynested functions

执行此操作
import pandas as pd
ID = [2001980,2001980,2001980,2001980,2001980,2001980,2001980,2001980,2001980,2001980,2002222,2002222,2002222,2002222,2002222,2002222,2002222,2002222,]
Date = ["10/30/2017","10/29/2017","10/28/2017","10/27/2017","10/26/2017","10/25/2017","10/24/2017","10/23/2017","10/22/2017","10/21/2017","10/21/2017","10/20/2017","10/19/2017","10/18/2017","10/17/2017","10/16/2017","10/15/2017","10/14/2017",]
current = [1 ,0 ,0 ,40,39,0 ,0 ,60,0 ,0 ,0 ,0 ,16,0 ,0 ,20,19,18,]

df = pd.DataFrame({"ID": ID, "Date": Date, "current": current})

然后创建更新框架的功能

Python 3.X

def update_frame(df):
    last_expected = None
    def apply_logic(row):
        nonlocal last_expected
        last_row_id = row.name - 1
        if row.name == 0:
            last_expected = row["current"]
            return last_expected
        last_row = df.iloc[[last_row_id]].iloc[0].to_dict()
        last_expected = max(last_expected-1,row['current']) if last_row['ID'] == row['ID'] else row['current']
        return last_expected
    return apply_logic

Python 2.X

def update_frame(df):
    sd = {"last_expected": None}
    def apply_logic(row):
        last_row_id = row.name - 1
        if row.name == 0:
            sd['last_expected'] = row["current"]
            return sd['last_expected']
        last_row = df.iloc[[last_row_id]].iloc[0].to_dict()
        sd['last_expected'] = max(sd['last_expected'] - 1,row['current']) if last_row['ID'] == row['ID'] else row['current']
        return sd['last_expected']
    return apply_logic

运行如下功能

df['expected'] = df.apply(update_frame(df), axis=1)

输出符合预期

Output

答案 2 :(得分:0)

编辑:解决OP关于扩展到数百万行的问题。

是的,我的原始答案不会扩展到非常大的数据帧。但是,通过少量编辑,这种易于阅读的解决方案将进行扩展。随后的优化利用了Numba中的JIT编译器。在导入Numba之后,我添加了jit装饰器并修改了函数以在numpy数组上操作而不是pandas对象。 Numba意识到numpy,无法优化pandas对象。

import numba

@numba.jit
def expected(id_col, current_col):
    lexp = []
    lstID = 0
    expected = 0
    for i in range(len(id_col)):
        id, current = id_col[i], current_col[i]
        if id == lstID:
            expected = max(current, max(expected - 1, 0))
        else:
            expected = current
        lexp.append(expected)
        lstID = id
    return np.array(lexp)

要将numpy数组传递给函数,请使用pandas系列的.values属性。

df1['expected'] = expected(df1.ID.values, df1.current.values)

为了测试性能,我将原始数据帧扩展到超过100万行。

df1 = df
while len(df1) < 1000000:
    df1 = pd.concat([df1, df1])
df1.reset_index(inplace=True, drop=True)

新的变化表现非常好。

%timeit expected(df1.ID.values, df1.current.values)
44.9 ms ± 249 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

df1.shape
Out[65]: (1179648, 4)

df1.tail(15)
Out[66]: 
              ID        Date  current  expected
1179633  2001980  10/27/2017       40        40
1179634  2001980  10/26/2017       39        39
1179635  2001980  10/25/2017        0        38
1179636  2001980  10/24/2017        0        37
1179637  2001980  10/23/2017       60        60
1179638  2001980  10/22/2017        0        59
1179639  2001980  10/21/2017        0        58
1179640  2002222  10/21/2017        0         0
1179641  2002222  10/20/2017        0         0
1179642  2002222  10/19/2017       16        16
1179643  2002222  10/18/2017        0        15
1179644  2002222  10/17/2017        0        14
1179645  2002222  10/16/2017       20        20
1179646  2002222  10/15/2017       19        19
1179647  2002222  10/14/2017       18        18

原始答案

有点蛮力,但很容易理解。

    def expected(df):
        lexp = []
        lstID = None
        expected = 0
        for i in range(len(df)):
            id, current = df[['ID', 'current']].iloc[i]
            if id == lstID:
                expected = max(expected - 1, 0)
                expected = max(current, expected)
            else:
                expected = current
            lexp.append(expected)
            lstID = id
        return pd.Series(lexp)

输出

df['expected'] = expected(df)

df
Out[53]: 
         ID        Date  current  expected
0   2001980  10/30/2017        1         1
1   2001980  10/29/2017        0         0
2   2001980  10/28/2017        0         0
3   2001980  10/27/2017       40        40
4   2001980  10/26/2017       39        39
5   2001980  10/25/2017        0        38
6   2001980  10/24/2017        0        37
7   2001980  10/23/2017       60        60
8   2001980  10/22/2017        0        59
9   2001980  10/21/2017        0        58
10  2002222  10/21/2017        0         0
11  2002222  10/20/2017        0         0
12  2002222  10/19/2017       16        16
13  2002222  10/18/2017        0        15
14  2002222  10/17/2017        0        14
15  2002222  10/16/2017       20        20
16  2002222  10/15/2017       19        19
17  2002222  10/14/2017       18        18

答案 3 :(得分:0)

我相信@Tarun Lalwani指出了一个正确的方向。那就是在DataFrame之外保存一些关键信息。虽然代码可以简化,但只要正确管理名称,使用全局变量没有任何问题。它是一种设计模式,通常可以使事情变得更简单并提高可读性。

cached_last = { 'expected': None, 'ID': None }

def set_expected(x):
    if cached_last['ID'] is None or x.ID != cached_last['ID']:
        expected = x.current
    else:
        expected =  max(cached_last['expected'] - 1, x.current)
    cached_last['ID'] = x.ID
    cached_last['expected'] = expected
    return expected

df['expected'] = df.apply(set_expected, axis=1)

从pandas.DataFrame.apply的文档中,请注意apply函数的潜在副作用。

In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first column/row.

答案 4 :(得分:0)

这里的逻辑应该是工作

lst=[]

for _, y in df.groupby('ID'):
    z=[]
    for i,(_, x) in enumerate(y.iterrows()):
        print(x)
        if x['current'] > 0:
           z.append(x['current'])
        else:
            try:
               z.append(max(z[i-1]-1,0))
            except:
               z.append(0)

    lst.extend(z)


lst

Out[484]: [1, 0, 0, 40, 39, 38, 37, 60, 59, 58, 0, 0, 16, 15, 14, 20, 19, 18]