Question

我想对熊猫数据框中“ group by”查询结果中的多列进行一些计算。

我的表：（实际行可以是30k，所有行都具有相同的日期，但时间不同）

 id1       date_time               adress       a_size       
 reom      2005-8-20 22:51:10      75157.5413   ceifwekd
 reom      2005-8-20  1:01:25      3571.37946   ceifwekd
 reom      2005-8-20 11:21:01      3571.37946   tnohcve
 reom      2005-8-20  8:29:09      97439.219    tnohcve
 penr      2005-8-20  17:07:16     97439.219    ceifwekd
 penr      2005-8-20  9:10:37      7391.6258    ceifwekd
 ....

在SQL（SQL服务器）中，计算可以表示为：

   select id1, count(distinct date_time) * 1.0 / count(distinct [adress]) as nums_per_dist_adress,

      count(distinct date_time) * 1.0/count(distinct(DATEPART(hh, date_time))) as ave_nums_per_hour, 

      count(distinct(DATEPART(hh, date_time))) as dist_hour, 

      count(distinct [adress]) * 1.0 /count(distinct(DATEPART(hh, date_time))) as ave_dist_ip_per_hour , 

      count(distinct date_time) as dist_nums, 

      count(distinct date_time) * 1.0 / (1 + count(distinct a_size))  as dist_num_per_a_size

  FROM my_table
  group by id1

我需要在熊猫中进行相同的计算。我可以为“ id1”与其他列的组合创建一些数据框。然后，一一加入。但是，我想通过一个熊猫查询来做到这一点。

我当前的方法：

 nums_by_id1_df = self.__df.groupby('id1').size().to_frame('nums').reset_index()
 a_size_by_adress_id1_df = self.__df.groupby(['id1', 'Adress'])['a_size'].nunique().to_frame('a_size_by_address').reset_index()

 new_df = pd.merge(nums_by_id1_df , a_size_by_adress_id1_df, on = 'id1', how = 'inner')
 new_df['new_col'] = new_df['nums'] / new_df['a_size_by_address']

此方法效率不高，并且在新df可能具有不同行数的情况下可能也不起作用，因为新列df与“ group by”聚合为不同的列。例如“按小时分组”将比“按分钟分组”具有更少的行。

有什么建议吗？谢谢

在一个查询中，在熊猫数据框中的“ group by”查询结果中的多个列之间进行一些计算

0 个答案: