F_Date B_Date col is_B
01/09/2019 02/08/2019 2200 1
01/09/2019 03/08/2019 672 1
02/09/2019 03/08/2019 1828 1
01/09/2019 04/08/2019 503 0
02/09/2019 04/08/2019 829 1
03/09/2019 04/08/2019 1367 0
02/09/2019 05/08/2019 559 1
03/09/2019 05/08/2019 922 1
04/09/2019 05/08/2019 1519 0
01/09/2019 06/08/2019 376 1
我想生成一列c_a
,以便对于flight_date的第一次输入,初始值是25000
,并根据col值减小。例如:
预期输出:
F_Date B_Date col is_B c_a
01/09/2019 02/08/2019 2200 1 25000
01/09/2019 03/08/2019 672 1 25000 - 2200
02/09/2019 03/08/2019 1828 1 25000
01/09/2019 04/08/2019 503 0 25000 - 2200 - 672
02/09/2019 04/08/2019 829 1 25000 - 1828
03/09/2019 04/08/2019 1367 0 25000
02/09/2019 05/08/2019 559 1 25000 - 1828 - 829
03/09/2019 05/08/2019 922 1 25000 (since last value had is_B as 0)
04/09/2019 05/08/2019 1519 0 25000
01/09/2019 06/08/2019 376 1 25000 - 2200 - 672 (Since last appearance had is_B as 0)
任何人都可以识别出实现这一目标的熊猫方法吗?
答案 0 :(得分:3)
我认为,我找到了一个非常简洁的解决方案:
TEXT
\connect{a-file1}
\begin{a-file2}
\connect{file3}
TEXT
75
结果是:
df['c_a'] = df.groupby('F_Date').apply(lambda grp:
25000 - grp.col.where(grp.is_B.eq(1), 0).shift(fill_value=0)
.cumsum()).reset_index(level=0, drop=True)
该想法以及基于组 F_Date == '01 / 09/2019'的示例:
F_Date B_Date col is_B c_a
0 01/09/2019 02/08/2019 2200 1 25000
1 01/09/2019 03/08/2019 672 1 22800
2 02/09/2019 03/08/2019 1828 1 25000
3 01/09/2019 04/08/2019 503 0 22128
4 02/09/2019 04/08/2019 829 1 23172
5 03/09/2019 04/08/2019 1367 0 25000
6 02/09/2019 05/08/2019 559 1 22343
7 03/09/2019 05/08/2019 922 1 25000
8 04/09/2019 05/08/2019 1519 0 25000
9 01/09/2019 06/08/2019 376 1 22128
-要从中减去的值
组中的下一个行:
grp.col.where(grp.is_B.eq(1), 0)
0 2200
1 672
3 0
9 376
-从当前中减去的值
组中的行:
.shift(fill_value=0)
0 0
1 2200
3 672
9 0
-减去的累积值:
.cumsum()
0 0
1 2200
3 2872
9 2872
-目标值:
25000 - ...
答案 1 :(得分:1)
漂亮的熊猫游戏:)
import pandas as pd
df = pd.DataFrame({'F_Date': [pd.to_datetime(_, format='%d/%m/%Y') for _ in
['01/09/2019', '01/09/2019', '02/09/2019', '01/09/2019', '02/09/2019',
'03/09/2019', '02/09/2019', '03/09/2019', '04/09/2019', '01/09/2019']],
'B_Date': [pd.to_datetime(_, format='%d/%m/%Y') for _ in
['02/08/2019', '03/08/2019', '03/08/2019', '04/08/2019', '04/08/2019',
'04/08/2019', '05/08/2019', '05/08/2019','05/08/2019', '06/08/2019']],
'col': [2200, 672, 1828, 503, 829, 1367, 559, 922, 1519, 376],
'is_B': [1, 1, 1, 0, 1, 0, 1, 1, 0, 1]
})
让我们逐步进行操作:
# sort in the order that fits the semantics of your calculations
df.sort_values(['F_Date', 'B_Date'], inplace=True)
# initialize 'c_a' to 25000 if a new F_Date starts
df.loc[df['F_Date'].diff(1) != pd.Timedelta(0), 'c_a'] = 25000
# Step downwards from every 25000 and substract shifted 'col'
# if shifted 'is_B' == 1, otherwise replicate shifted 'c_a' to the next line
while pd.isna(df.c_a).any():
df.c_a.where(
pd.notna(df.c_a), # set every not-NaN value to ...
df.c_a.shift(1).where( # ...the previous / shifted c_a...
df.is_B.shift(1) == 0, # ... if previous / shifted is_B == 0
df.c_a.shift(1) - df.col.shift(1) # ... otherwise substract shifted 'col'
), inplace=True
)
# restore original order
df.sort_index(inplace=True)
这是我得到的结果
F_Date B_Date col is_B c_a
0 2019-09-01 2019-08-02 2200 1 25000.0
1 2019-09-01 2019-08-03 672 1 22800.0
2 2019-09-02 2019-08-03 1828 1 25000.0
3 2019-09-01 2019-08-04 503 0 22128.0
4 2019-09-02 2019-08-04 829 1 23172.0
5 2019-09-03 2019-08-04 1367 0 25000.0
6 2019-09-02 2019-08-05 559 1 22343.0
7 2019-09-03 2019-08-05 922 1 25000.0
8 2019-09-04 2019-08-05 1519 0 25000.0
9 2019-09-01 2019-08-06 376 1 22128.0
答案 2 :(得分:0)
尝试使用shift
,cumsum
和ffill
的groupby
m = ~df.groupby('F_Date').is_B.diff().eq(1)
s = (-df.col).groupby(df.F_Date).apply(lambda x: x.shift(fill_value=25000).cumsum())
df['c_a'] = s.where(m).groupby(df.F_Date).ffill()
Out[98]:
F_Date B_Date col is_B c_a
0 01/09/2019 02/08/2019 2200 1 25000.0
1 01/09/2019 03/08/2019 672 1 22800.0
2 02/09/2019 03/08/2019 1828 1 25000.0
3 01/09/2019 04/08/2019 503 0 22128.0
4 02/09/2019 04/08/2019 829 1 23172.0
5 03/09/2019 04/08/2019 1367 0 25000.0
6 02/09/2019 05/08/2019 559 1 22343.0
7 03/09/2019 05/08/2019 922 1 25000.0
8 04/09/2019 05/08/2019 1519 0 25000.0
9 01/09/2019 06/08/2019 376 1 22128.0
答案 3 :(得分:0)
问题的答案分为两个部分,首先需要做的是将数据帧按F_Date
分组。有了该名称后,就可以使用rolling()
对所有前面的值进行运算,直到达到给定值。这里有一些问题:
expanding.apply
时,您只能返回单个实数值我们可以通过将组数据帧和初始数据帧都传递到我们在Apply上使用并在其中设置值的方法来解决,此解决方案可能不是完美的或性能最好的解决方案,只是有效。
In [1]: s = '''F_Date B_Date col is_B
...: 01/09/2019 02/08/2019 2200 1
...: 01/09/2019 03/08/2019 672 1
...: 02/09/2019 03/08/2019 1828 1
...: 01/09/2019 04/08/2019 503 0
...: 02/09/2019 04/08/2019 829 1
...: 03/09/2019 04/08/2019 1367 0
...: 02/09/2019 05/08/2019 559 1
...: 03/09/2019 05/08/2019 922 1
...: 04/09/2019 05/08/2019 1519 0
...: 01/09/2019 06/08/2019 376 1'''
In [2]: import re
In [3]: sl = [re.split('\s+',x) for x in s.split('\n')]
In [4]: import pandas as pd
In [5]: df = pd.DataFrame(sl[1:], columns=sl[0])
In [6]: df['F_Date'] = df['F_Date'].astype('datetime64[ns]')
In [7]: df['B_Date'] = df['B_Date'].astype('datetime64[ns]')
In [8]: df['col'] = df['col'].astype(int)
In [9]: df['is_B'] = df['is_B'].astype(int)
In [10]: df['c_a'] = None
In [11]: def l(df, df_g, cols):
...: is_Bs = df_g['is_B'].values[:len(cols)]
...: values = [2500]+ [cols[i] for i in range(len(cols)-1) if is_Bs[i] ]
...: df.at[df_g.index[len(cols)-1], 'c_a'] = values
...: return 1
In [12]: for dt, df_g in df.groupby('F_Date'):
...: df_g['col'].expanding().apply(lambda x: l(df, df_g, x),raw= True)
...:
In [13]: df
Out[13]:
F_Date B_Date col is_B c_a
0 2019-01-09 2019-02-08 2200 1 [2500]
1 2019-01-09 2019-03-08 672 1 [2500, 2200.0]
2 2019-02-09 2019-03-08 1828 1 [2500]
3 2019-01-09 2019-04-08 503 0 [2500, 2200.0, 672.0]
4 2019-02-09 2019-04-08 829 1 [2500, 1828.0]
5 2019-03-09 2019-04-08 1367 0 [2500]
6 2019-02-09 2019-05-08 559 1 [2500, 1828.0, 829.0]
7 2019-03-09 2019-05-08 922 1 [2500]
8 2019-04-09 2019-05-08 1519 0 [2500]
9 2019-01-09 2019-06-08 376 1 [2500, 2200.0, 672.0]