根据要求,我提供了一个可复制的示例。
有链接可以访问我的dataframe的1/6(通过pandas.DataFrame
进行序列化的Pickle
对象)和jupyter notebook
以获得可复制的代码,其中有一个示例正确应用该功能的数据框,不应用该功能的较大数据框。
通知保管箱会说视图不可用,但文件可用,请告诉我是否可用。
pool.map()
链接到此problematic,我在数据帧的示例上使用此方法来查看其是否正确,情况如下:
m = dfsample.Result.eq('Win')
s = m.shift().cumsum()
dfsample['gap_in_days'] = dfsample.groupby(['name', s])['Gap done'].cumsum() #"Expected Gap" in the linked topic
dfsample['nb_of_games'] = dfsample.assign(nb_of_games = 1).groupby('name')['nb_of games'].apply(lambda x:x.shift().cumsum()).fillna(0)
dfsample['gap_in_numbers'] = dfsample.assign(nb = 1).groupby(['name',s])['nb'].cumsum()
它呈现了我的期望:
+-----------+------------+---------------------+----------+-------------+-------------+----------------+
| Player | Result | Date | Gap done | gap_in_days | nb_of_games | gap_in_numbers |
+-----------+------------+---------------------+----------+-------------+-------------+----------------+
| K2000 | Lose | 2015-11-13 13:42:00 | 0.0 | 0.0 | 0 | -1 * |
| K2000 | Lose | 2016-03-23 16:40:00 | 131.0 | 131.0 | 1 | 1 |
| K2000 | Lose | 2016-05-16 19:17:00 | 54.0 | 185.0 | 2 | 2 |
| K2000 | Win | 2016-06-09 19:36:00 | 54.0 | 239.0 | 3 | 3 |
| K2000 | Win | 2016-06-30 14:05:00 | 54.0 | 54.0 | 4 | 1 |
| K2000 | Lose | 2016-07-29 16:20:00 | 29.0 | 29.0 | 5 | 2 |
| K2000 | Win | 2016-10-08 17:48:00 | 29.0 | 58.0 | 6 | 3 |
| Kssis | Lose | 2007-02-25 15:05:00 | 0.0 | 0.0 | 0 | 1 * |
| Kssis | Lose | 2007-04-25 6:07:00 | 59.0 | 59.0 | 1 | 1 |
| Kssis | Not-ranked | 2007-06-01 16:54:00 | 37.0 | 96.0 | 2 | 2 |
| Kssis | Lose | 2007-09-09 14:33:00 | 99.0 | 195.0 | 3 | 3 |
| Kssis | Lose | 2008-04-06 16:27:00 | 210.0 | 405.0 | 4 | 4 |
+-----------+------------+---------------------+----------+-------------+-------------+----------------+
为解释数据,Gap done
是两个不同游戏之间的天数。 gap_in_days
是玩家赢得游戏的天数。我猜nb_of_games
令人讨厌。 gap_in_numbers
是直到玩家获胜为止所玩的游戏数量。
注意:有关带*的值。我知道这些结果很奇怪,但是正如我对安迪·L所说的那样,这是可以纠正的。我只是将nb_of_games
为0时替换为0。此外,我向您展示了它,因为如果进行测试,您显然会看到它并进行询问。
现在,当我在pool.map(function , iterable)
的函数中应用相同的东西时,它不起作用,而在数据帧dfsample
的示例中应用相同的函数完全没问题。
功能如下:
def gap_nb(df):
s = mask_result(df)
df['gap_in_numbers'] = df.assign(nb = 1).groupby(['name',s])['nb'].cumsum()
return df
函数mask_result
为:
def mask_result(df):
mask = df.Result.eq('P')
s = mask.shift().cumsum()
return s
然后在将pool.map(function, iterable)
用作
dfs = pool.map(gap_nb , dfs) #where dfs is a list of slices of a big dataframe
它只是将gap_in_numbers
的列1
呈现为:
+----------------+
| gap_in_numbers |
+----------------+
| 0 |
| 1 |
| 1 |
| 1 |
| 1 |
| ... |
| 1 |
+----------------+
我试图找到一些方法,例如在另一个函数中使用assign()
,然后在另一个函数中应用cumsum()
,但返回的结果相同。
那么,谁能告诉我为什么?
Pandas版本:0.23.4 Python版本:3.7.4
示例数据(无最后一列)
import io
s = '''Player,Result,Date,Gap,done,gap_in_days,nb_of_games
K2000,Lose,2015-11-13,13:42:00,0.0,0.0,0
K2000,Lose,2016-03-23,16:40:00,131.0,131.0,1
K2000,Lose,2016-05-16,19:17:00,54.0,185.0,2
K2000,Win,2016-06-09,19:36:00,54.0,239.0,3
K2000,Win,2016-06-30,14:05:00,54.0,54.0,4
K2000,Lose,2016-07-29,16:20:00,29.0,29.0,5
K2000,Win,2016-10-08,17:48:00,29.0,58.0,6
Kssis,Lose,2007-02-25,15:05:00,0.0,0.0,0
Kssis,Lose,2007-04-25,6:07:00,59.0,59.0,1
Kssis,Not-ranked,2007-06-01,16:54:00,37.0,96.0,2
Kssis,Lose,2007-09-09,14:33:00,99.0,195.0,3
Kssis,Lose,2008-04-06,16:27:00,210.0,405.0,4'''
df = pd.read_csv(io.StringIO(s))