好奇的不适用于更大的pandas.DataFrame

时间:2019-11-24 14:03:37

标签: python python-3.x pandas multiprocessing

更新的问题

根据要求,我提供了一个可复制的示例。
有链接可以访问我的dataframe的1/6(​​通过pandas.DataFrame进行序列化的Pickle对象)和jupyter notebook以获得可复制的代码,其中有一个示例正确应用该功能的数据框,不应用该功能的较大数据框。

通知保管箱会说视图不可用,但文件可用,请告诉我是否可用。

事件问题,该问题最终不是来自pool.map()

链接到此problematic,我在数据帧的示例上使用此方法来查看其是否正确,情况如下:

m = dfsample.Result.eq('Win')
s = m.shift().cumsum()
dfsample['gap_in_days'] = dfsample.groupby(['name', s])['Gap done'].cumsum() #"Expected Gap" in the linked topic
dfsample['nb_of_games'] = dfsample.assign(nb_of_games = 1).groupby('name')['nb_of games'].apply(lambda x:x.shift().cumsum()).fillna(0)
dfsample['gap_in_numbers'] = dfsample.assign(nb = 1).groupby(['name',s])['nb'].cumsum()

它呈现了我的期望:

+-----------+------------+---------------------+----------+-------------+-------------+----------------+
|    Player |   Result   |        Date         | Gap done | gap_in_days | nb_of_games | gap_in_numbers |
+-----------+------------+---------------------+----------+-------------+-------------+----------------+
| K2000     | Lose       | 2015-11-13 13:42:00 |      0.0 |         0.0 |           0 | -1 *           |
| K2000     | Lose       | 2016-03-23 16:40:00 |    131.0 |       131.0 |           1 | 1              |
| K2000     | Lose       | 2016-05-16 19:17:00 |     54.0 |       185.0 |           2 | 2              |
| K2000     | Win        | 2016-06-09 19:36:00 |     54.0 |       239.0 |           3 | 3              |
| K2000     | Win        | 2016-06-30 14:05:00 |     54.0 |        54.0 |           4 | 1              |
| K2000     | Lose       | 2016-07-29 16:20:00 |     29.0 |        29.0 |           5 | 2              |
| K2000     | Win        | 2016-10-08 17:48:00 |     29.0 |        58.0 |           6 | 3              |
| Kssis     | Lose       | 2007-02-25 15:05:00 |      0.0 |         0.0 |           0 | 1 *            |
| Kssis     | Lose       | 2007-04-25 6:07:00  |     59.0 |        59.0 |           1 | 1              |
| Kssis     | Not-ranked | 2007-06-01 16:54:00 |     37.0 |        96.0 |           2 | 2              |
| Kssis     | Lose       | 2007-09-09 14:33:00 |     99.0 |       195.0 |           3 | 3              |
| Kssis     | Lose       | 2008-04-06 16:27:00 |    210.0 |       405.0 |           4 | 4              |
+-----------+------------+---------------------+----------+-------------+-------------+----------------+

为解释数据,Gap done是两个不同游戏之间的天数。 gap_in_days是玩家赢得游戏的天数。我猜nb_of_games令人讨厌。 gap_in_numbers是直到玩家获胜为止所玩的游戏数量。
注意:有关带*的值。我知道这些结果很奇怪,但是正如我对安迪·L所说的那样,这是可以纠正的。我只是将nb_of_games为0时替换为0。此外,我向您展示了它,因为如果进行测试,您显然会看到它并进行询问。

现在,当我在pool.map(function , iterable)的函数中应用相同的东西时,它不起作用,而在数据帧dfsample的示例中应用相同的函数完全没问题。

功能如下:

def gap_nb(df):
    s = mask_result(df)
    df['gap_in_numbers'] = df.assign(nb = 1).groupby(['name',s])['nb'].cumsum()
    return df

函数mask_result为:

def mask_result(df):
    mask = df.Result.eq('P')
    s = mask.shift().cumsum()
    return s

然后在将pool.map(function, iterable)用作

dfs = pool.map(gap_nb , dfs) #where dfs is a list of slices of a big dataframe

它只是将gap_in_numbers的列1呈现为:

+----------------+
| gap_in_numbers |
+----------------+
|              0 |
|              1 |
|              1 |
|              1 |
|              1 |
|            ... |
|              1 |
+----------------+

我试图找到一些方法,例如在另一个函数中使用assign(),然后在另一个函数中应用cumsum(),但返回的结果相同。

那么,谁能告诉我为什么?


Pandas版本:0.23.4 Python版本:3.7.4


示例数据(无最后一列)

import io
s = '''Player,Result,Date,Gap,done,gap_in_days,nb_of_games
K2000,Lose,2015-11-13,13:42:00,0.0,0.0,0
K2000,Lose,2016-03-23,16:40:00,131.0,131.0,1
K2000,Lose,2016-05-16,19:17:00,54.0,185.0,2
K2000,Win,2016-06-09,19:36:00,54.0,239.0,3
K2000,Win,2016-06-30,14:05:00,54.0,54.0,4
K2000,Lose,2016-07-29,16:20:00,29.0,29.0,5
K2000,Win,2016-10-08,17:48:00,29.0,58.0,6
Kssis,Lose,2007-02-25,15:05:00,0.0,0.0,0
Kssis,Lose,2007-04-25,6:07:00,59.0,59.0,1
Kssis,Not-ranked,2007-06-01,16:54:00,37.0,96.0,2
Kssis,Lose,2007-09-09,14:33:00,99.0,195.0,3
Kssis,Lose,2008-04-06,16:27:00,210.0,405.0,4'''

df = pd.read_csv(io.StringIO(s))

0 个答案:

没有答案