这是当前的数据帧:
> ID Date current
> 2001980 10/30/2017 1
> 2001980 10/29/2017 0
> 2001980 10/28/2017 0
> 2001980 10/27/2017 40
> 2001980 10/26/2017 39
> 2001980 10/25/2017 0
> 2001980 10/24/2017 0
> 2001980 10/23/2017 60
> 2001980 10/22/2017 0
> 2001980 10/21/2017 0
> 2002222 10/21/2017 0
> 2002222 10/20/2017 0
> 2002222 10/19/2017 16
> 2002222 10/18/2017 0
> 2002222 10/17/2017 0
> 2002222 10/16/2017 20
> 2002222 10/15/2017 19
> 2002222 10/14/2017 18
以下是最终数据框。列expected
就是我想要的。
非常感谢你。
> ID Date current expected
> 2001980 10/30/2017 1 1
> 2001980 10/29/2017 0 0
> 2001980 10/28/2017 0 0
> 2001980 10/27/2017 40 40
> 2001980 10/26/2017 39 39
> 2001980 10/25/2017 0 38
> 2001980 10/24/2017 0 37
> 2001980 10/23/2017 60 60
> 2001980 10/22/2017 0 59
> 2001980 10/21/2017 0 58
> 2002222 10/21/2017 0 0
> 2002222 10/20/2017 0 0
> 2002222 10/19/2017 16 16
> 2002222 10/18/2017 0 15
> 2002222 10/17/2017 0 14
> 2002222 10/16/2017 20 20
> 2002222 10/15/2017 19 19
> 2002222 10/14/2017 18 18
我使用的Excel使用以下公式:
= if(此行的ID =最后一行的ID, max(最后一行的预期值 - 1,此行的当前值), 这一行的当前值)
答案 0 :(得分:4)
修改更简单:
df['expected'] = df.groupby(['ID',df.current.ne(0).cumsum()])['current']\
.transform(lambda x: x.eq(0).cumsum().mul(-1).add(x.iloc[0])).clip(0,np.inf)
让我们玩得开心:
df['expected'] = (df.groupby('ID')['current'].transform(lambda x: x.where(x.ne(0)).ffill()) +
df.groupby(['ID',df.current.ne(0).cumsum()])['current'].transform(lambda x: x.eq(0).cumsum()).mul(-1))\
.clip(0,np.inf).fillna(0).astype(int)
print(df)
输出:
ID Date current expected
0 2001980 10/30/2017 1 1
1 2001980 10/29/2017 0 0
2 2001980 10/28/2017 0 0
3 2001980 10/27/2017 40 40
4 2001980 10/26/2017 39 39
5 2001980 10/25/2017 0 38
6 2001980 10/24/2017 0 37
7 2001980 10/23/2017 60 60
8 2001980 10/22/2017 0 59
9 2001980 10/21/2017 0 58
10 2002222 10/21/2017 0 0
11 2002222 10/20/2017 0 0
12 2002222 10/19/2017 16 16
13 2002222 10/18/2017 0 15
14 2002222 10/17/2017 0 14
15 2002222 10/16/2017 20 20
16 2002222 10/15/2017 19 19
17 2002222 10/14/2017 18 18
#Let's calculate two series first a series to fill the zeros in an 'ID' with the previous non-zero value
s1 = df.groupby('ID')['current'].transform(lambda x: x.where(x.ne(0)).ffill())
s1
输出:
0 1.0
1 1.0
2 1.0
3 40.0
4 39.0
5 39.0
6 39.0
7 60.0
8 60.0
9 60.0
10 NaN
11 NaN
12 16.0
13 16.0
14 16.0
15 20.0
16 19.0
17 18.0
Name: current, dtype: float64
#Second series is a cumulative count of zeroes in a group by 'ID'
s2 = df.groupby(['ID',df.current.ne(0).cumsum()])['current'].transform(lambda x: x.eq(0).cumsum()).mul(-1)
s2
输出:
0 0
1 -1
2 -2
3 0
4 0
5 -1
6 -2
7 0
8 -1
9 -2
10 -1
11 -2
12 0
13 -1
14 -2
15 0
16 0
17 0
Name: current, dtype: int32
(s1 + s2).clip(0, np.inf).fillna(0)
输出:
0 1.0
1 0.0
2 0.0
3 40.0
4 39.0
5 38.0
6 37.0
7 60.0
8 59.0
9 58.0
10 0.0
11 0.0
12 16.0
13 15.0
14 14.0
15 20.0
16 19.0
17 18.0
Name: current, dtype: float64
答案 1 :(得分:3)
因此,您可以使用apply
和nested functions
import pandas as pd
ID = [2001980,2001980,2001980,2001980,2001980,2001980,2001980,2001980,2001980,2001980,2002222,2002222,2002222,2002222,2002222,2002222,2002222,2002222,]
Date = ["10/30/2017","10/29/2017","10/28/2017","10/27/2017","10/26/2017","10/25/2017","10/24/2017","10/23/2017","10/22/2017","10/21/2017","10/21/2017","10/20/2017","10/19/2017","10/18/2017","10/17/2017","10/16/2017","10/15/2017","10/14/2017",]
current = [1 ,0 ,0 ,40,39,0 ,0 ,60,0 ,0 ,0 ,0 ,16,0 ,0 ,20,19,18,]
df = pd.DataFrame({"ID": ID, "Date": Date, "current": current})
然后创建更新框架的功能
Python 3.X
def update_frame(df):
last_expected = None
def apply_logic(row):
nonlocal last_expected
last_row_id = row.name - 1
if row.name == 0:
last_expected = row["current"]
return last_expected
last_row = df.iloc[[last_row_id]].iloc[0].to_dict()
last_expected = max(last_expected-1,row['current']) if last_row['ID'] == row['ID'] else row['current']
return last_expected
return apply_logic
Python 2.X
def update_frame(df):
sd = {"last_expected": None}
def apply_logic(row):
last_row_id = row.name - 1
if row.name == 0:
sd['last_expected'] = row["current"]
return sd['last_expected']
last_row = df.iloc[[last_row_id]].iloc[0].to_dict()
sd['last_expected'] = max(sd['last_expected'] - 1,row['current']) if last_row['ID'] == row['ID'] else row['current']
return sd['last_expected']
return apply_logic
运行如下功能
df['expected'] = df.apply(update_frame(df), axis=1)
输出符合预期
答案 2 :(得分:0)
编辑:解决OP关于扩展到数百万行的问题。
是的,我的原始答案不会扩展到非常大的数据帧。但是,通过少量编辑,这种易于阅读的解决方案将进行扩展。随后的优化利用了Numba中的JIT编译器。在导入Numba之后,我添加了jit装饰器并修改了函数以在numpy数组上操作而不是pandas对象。 Numba意识到numpy,无法优化pandas对象。
import numba
@numba.jit
def expected(id_col, current_col):
lexp = []
lstID = 0
expected = 0
for i in range(len(id_col)):
id, current = id_col[i], current_col[i]
if id == lstID:
expected = max(current, max(expected - 1, 0))
else:
expected = current
lexp.append(expected)
lstID = id
return np.array(lexp)
要将numpy数组传递给函数,请使用pandas系列的.values
属性。
df1['expected'] = expected(df1.ID.values, df1.current.values)
为了测试性能,我将原始数据帧扩展到超过100万行。
df1 = df
while len(df1) < 1000000:
df1 = pd.concat([df1, df1])
df1.reset_index(inplace=True, drop=True)
新的变化表现非常好。
%timeit expected(df1.ID.values, df1.current.values)
44.9 ms ± 249 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
df1.shape
Out[65]: (1179648, 4)
df1.tail(15)
Out[66]:
ID Date current expected
1179633 2001980 10/27/2017 40 40
1179634 2001980 10/26/2017 39 39
1179635 2001980 10/25/2017 0 38
1179636 2001980 10/24/2017 0 37
1179637 2001980 10/23/2017 60 60
1179638 2001980 10/22/2017 0 59
1179639 2001980 10/21/2017 0 58
1179640 2002222 10/21/2017 0 0
1179641 2002222 10/20/2017 0 0
1179642 2002222 10/19/2017 16 16
1179643 2002222 10/18/2017 0 15
1179644 2002222 10/17/2017 0 14
1179645 2002222 10/16/2017 20 20
1179646 2002222 10/15/2017 19 19
1179647 2002222 10/14/2017 18 18
原始答案
有点蛮力,但很容易理解。
def expected(df):
lexp = []
lstID = None
expected = 0
for i in range(len(df)):
id, current = df[['ID', 'current']].iloc[i]
if id == lstID:
expected = max(expected - 1, 0)
expected = max(current, expected)
else:
expected = current
lexp.append(expected)
lstID = id
return pd.Series(lexp)
输出
df['expected'] = expected(df)
df
Out[53]:
ID Date current expected
0 2001980 10/30/2017 1 1
1 2001980 10/29/2017 0 0
2 2001980 10/28/2017 0 0
3 2001980 10/27/2017 40 40
4 2001980 10/26/2017 39 39
5 2001980 10/25/2017 0 38
6 2001980 10/24/2017 0 37
7 2001980 10/23/2017 60 60
8 2001980 10/22/2017 0 59
9 2001980 10/21/2017 0 58
10 2002222 10/21/2017 0 0
11 2002222 10/20/2017 0 0
12 2002222 10/19/2017 16 16
13 2002222 10/18/2017 0 15
14 2002222 10/17/2017 0 14
15 2002222 10/16/2017 20 20
16 2002222 10/15/2017 19 19
17 2002222 10/14/2017 18 18
答案 3 :(得分:0)
我相信@Tarun Lalwani指出了一个正确的方向。那就是在DataFrame之外保存一些关键信息。虽然代码可以简化,但只要正确管理名称,使用全局变量没有任何问题。它是一种设计模式,通常可以使事情变得更简单并提高可读性。
cached_last = { 'expected': None, 'ID': None }
def set_expected(x):
if cached_last['ID'] is None or x.ID != cached_last['ID']:
expected = x.current
else:
expected = max(cached_last['expected'] - 1, x.current)
cached_last['ID'] = x.ID
cached_last['expected'] = expected
return expected
df['expected'] = df.apply(set_expected, axis=1)
从pandas.DataFrame.apply的文档中,请注意apply
函数的潜在副作用。
In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first column/row.
答案 4 :(得分:0)
这里的逻辑应该是工作
lst=[]
for _, y in df.groupby('ID'):
z=[]
for i,(_, x) in enumerate(y.iterrows()):
print(x)
if x['current'] > 0:
z.append(x['current'])
else:
try:
z.append(max(z[i-1]-1,0))
except:
z.append(0)
lst.extend(z)
lst
Out[484]: [1, 0, 0, 40, 39, 38, 37, 60, 59, 58, 0, 0, 16, 15, 14, 20, 19, 18]