将大熊猫中的groupby()分成较小的组并将其组合

时间:2018-08-07 09:56:27

标签: python python-2.7 pandas grouping pandas-groupby

            city  temperature  windspeed   event
            day                                                 
            2017-01-01  new york           32          6    Rain
            2017-01-02  new york           36          7   Sunny
            2017-01-03  new york           28         12    Snow
            2017-01-04  new york           33          7   Sunny
            2017-01-05  new york           31          7    Rain
            2017-01-06  new york           33          5   Sunny
            2017-01-07  new york           27         12    Rain
            2017-01-08  new york           23          7  Rain
            2017-01-01    mumbai           90          5   Sunny
            2017-01-02    mumbai           85         12     Fog
            2017-01-03    mumbai           87         15     Fog
            2017-01-04    mumbai           92          5    Rain
            2017-01-05    mumbai           89          7   Sunny
            2017-01-06    mumbai           80         10     Fog
            2017-01-07    mumbai           85         9     Sunny
            2017-01-08    mumbai           89          8    Rain
            2017-01-01     paris           45         20   Sunny
            2017-01-02     paris           50         13  Cloudy
            2017-01-03     paris           54          8  Cloudy
            2017-01-04     paris           42         10  Cloudy
            2017-01-05     paris           43         20   Sunny
            2017-01-06     paris           48         4  Cloudy
            2017-01-07     paris           40          14  Rain
            2017-01-08     paris           42         15  Cloudy
            2017-01-09     paris           53         8  Sunny

上面显示了原始数据。

下面显示了使用np.array_split(data,4)的结果。

            day city  temperature  windspeed  event                                                
            2017-01-01  new york           32          6    Rain
            2017-01-02  new york           36          7   Sunny
            2017-01-03  new york           28         12    Snow
            2017-01-04  new york           33          7   Sunny
            2017-01-05  new york           31          7    Rain
            2017-01-06  new york           33          5   Sunny
            2017-01-07  new york           27         12    Rain  

            day city  temperature  windspeed  event                                                    
            2017-01-08  new york           23          7  Rain
            2017-01-01    mumbai           90          5   Sunny
            2017-01-02    mumbai           85         12     Fog
            2017-01-03    mumbai           87         15     Fog
            2017-01-04    mumbai           92          5    Rain
            2017-01-05    mumbai           89          7   Sunny             
            day city  temperature  windspeed  event                                                  
            2017-01-06    mumbai           80         10     Fog
            2017-01-07    mumbai           85         9     Sunny
            2017-01-08    mumbai           89          8    Rain
            2017-01-01     paris           45         20   Sunny
            2017-01-02     paris           50         13  Cloudy
            2017-01-03     paris           54          8  Cloudy              
            day city  temperature  windspeed  event             
            2017-01-04     paris           42         10  Cloudy
            2017-01-05     paris           43         20   Sunny
            2017-01-06     paris           48         4  Cloudy
            2017-01-07     paris           40          14  Rain
            2017-01-08     paris           42         15  Cloudy
            2017-01-09     paris           53         8  Sunny

正如您在此处看到的,我正在尝试根据原始数据创建4个组,确保每个组都包含所有城市。但是,通过使用array.split(),它将数据分为4组,但并不包含所有城市。我希望每个小组都有孟买,巴黎和纽约。  我该怎么办?

要说的是,我要实现的目标如下:

第1组:

            day city  temperature  windspeed  event                                                
            2017-01-01  new york           32          6   Rain
            2017-01-02  paris           50         13  Cloudy
            2017-01-02    mumbai           85         12    Fog, 
            2017-01-05  new york           31          7    Rain
            2017-01-06  new york           33          5   Sunny
            2017-01-05    mumbai           89          7   Sunny  
            2017-01-05     paris           43         20   Sunny

第2组:

            day city  temperature  windspeed  event                                                    
            2017-01-04  new york           33          7  Sunny
            2017-01-01    mumbai           90          5  Sunny
            2017-01-03  paris           54          8  Cloudy
            2017-01-07  new york           27         12    Rain 
            2017-01-06    mumbai           80         10     Fog
            2017-01-09     paris           53         8  Sunny

第3组:

            day city  temperature  windspeed  event         
            2017-01-02  new york           36          7  Sunny                                         
            2017-01-03  mumbai           87         15    Fog
            2017-01-01   paris           45         20  Sunny,   
            2017-01-08    mumbai           89          8    Rain
            2017-01-06     paris           48         4  Cloudy
            2017-01-07     paris           40          14  Rain

第4组:

            day city  temperature  windspeed  event             
            2017-01-03  new york           28         12   Snow,  
            2017-01-04  mumbai           92          5   Rain
            2017-01-07    mumbai           85         9     Sunny
            2017-01-04  paris           42         10  Cloudy
            2017-01-08     paris           42         15  Cloudy
            2017-01-08  new york           23          7  Rain

从预期结果可以看到,主要的是所有组都包含每个主题。

我要记住的是将数据按城市分组,然后从每个城市的数据框中将数据分为4组,然后针对城市中的每个组,将数据组合起来得到4个最终组。

2 个答案:

答案 0 :(得分:2)

您可以通过GroupBy + cumcount创建一个帮助列,以计算每个城市的发生次数。

然后将dict + tuple与另一个GroupBy一起使用,以创建一个数据帧字典,每个数据帧都包含每个城市的一个事件。

# add index column giving count of city occurrence
df['index'] = df.groupby('city').cumcount()

# create dictionary of dataframes
d = dict(tuple(df.groupby('index')))

结果:

print(d)

{0:                city  temperature  windspeed  event  index
 day                                                      
 2017-01-01  newyork           32          6   Rain      0
 2017-01-01   mumbai           90          5  Sunny      0
 2017-01-01    paris           45         20  Sunny      0,
 1:                city  temperature  windspeed   event  index
 day                                                       
 2017-01-02  newyork           36          7   Sunny      1
 2017-01-02   mumbai           85         12     Fog      1
 2017-01-02    paris           50         13  Cloudy      1,
 2:                city  temperature  windspeed   event  index
 day                                                       
 2017-01-03  newyork           28         12    Snow      2
 2017-01-03   mumbai           87         15     Fog      2
 2017-01-03    paris           54          8  Cloudy      2,
 3:                city  temperature  windspeed   event  index
 day                                                       
 2017-01-04  newyork           33          7   Sunny      3
 2017-01-04   mumbai           92          5    Rain      3
 2017-01-04    paris           42         10  Cloudy      3}

然后您可以通过d[0]d[1]d[2]d[3]提取单个“组”。在这种情况下,您可能希望按日期分组,即

d = {df_.index[0]: df_ for _, df_ in df.groupby('index')}

答案 1 :(得分:0)

这是我的处理方法。首先按daycity对数据框进行排序:

df = df.sort_values(by=['day', 'city'])

接下来,为您的数据帧找到4个分组的均等分组-如果分组不均匀,则最后一个分组将获得其余分组:

n = int(len(df)/4)
groups_n = np.cumsum([0, n, n, n, len(df)-(3*n)])
print(groups_n)
OUT >> array([ 0,  6, 12, 18, 25], dtype=int32)

groups_n是每个组的startend。因此,Group 1我将承担df.iloc[0:6],而Group 4我将承担df.iloc[18:25]

因此,数据帧的4组拆分的最终字典d将是:

d = {}
for i in range(4):
    d[i+1] = df.iloc[groups_n[i]:groups_n[i+1]]
  

示例输出:第1组(d[1]

            city      temperature  windspeed    event
day             
2017-01-01  mumbai    90           5            Sunny
2017-01-01  new york  32           6            Rain
2017-01-01  paris     45           20           Sunny
2017-01-02  mumbai    85           12           Fog
2017-01-02  new york  36           7            Sunny
2017-01-02  paris     50           13           Cloudy
  

第4组:(d[4]

            city       temperature  windspeed   event
day             
2017-01-07  mumbai     85           9           Sunny
2017-01-07  new york   27           12          Rain
2017-01-07  paris      40           14          Rain
2017-01-08  mumbai     89           8           Rain
2017-01-08  new york   23           7           Rain
2017-01-08  paris      42           15          Cloudy
2017-01-09  paris      53           8           Sunny
相关问题