使用pandas按地区和日期分组

时间:2016-06-15 13:15:30

标签: python csv pandas dataframe grouping

temp.csv

您好,

我有一个csv文件,请检查图像中的示例输入csv,我需要获得一个数据帧,其中包含" Amazon Elastic计算云的总和"在特定"可用区域上运行的服务"根据日期对其进行分组。

像这样的东西

|UsageStartDate| AvaliabilityZone  | Sum of products used |  Total cost for each

[6/1/16,  ap-northeast-1a, Amazon Elastic compute cloud = 6, 15$]
[6/2/16,  ap-southeast-2 , Amazon Elastic compute cloud = 3,   12$]

这就是我尝试使用熊猫的方式:

funk = pd.read_csv('/tmp/temp.csv')
funk.sort_values('UsageStartDate') 
k = funk['AvailabilityZone'][funk['ProductName'] == 'Amazon Elastic Compute Cloud'].sum()
print  k 

对此有何帮助?我学习大熊猫

以下是数据:

    ProductName               AvailabilityZone  UsageStartDate  BlendedCost
0   Amazon Simple Queue Service                   6/1/16 0:00       0
1   Alexa Web Information Service                 6/1/16 0:00       0.00347032
2   Amazon DynamoDB        ap-southeast-2          6/1/16 0:00      0
3   Amazon DynamoDB        ap-southeast-2          6/1/16 0:00      0
4   Amazon Elastic Compute Cloud    ap-northeast-1a 6/1/16 0:00     0.1
5   Amazon Elastic Compute Cloud    ap-northeast-1a 6/1/16 0:00     0.02
6   Amazon Elastic Compute Cloud                     6/1/16 0:00    0
7   Amazon Elastic Compute Cloud                     6/1/16 0:00    0
8   Amazon Elastic Compute Cloud                     6/1/16 0:00    4.70E-06
9   Amazon Elastic Compute Cloud                     6/1/16 0:00    8.00E-08
10  Amazon Elastic Compute Cloud                     6/1/16 0:00    0.00133333
11  Amazon Elastic Compute Cloud                     6/1/16 0:00    0.005
12  Amazon Elastic Compute Cloud    ap-southeast-1a 6/1/16 0:00     0.02
13  Amazon Elastic Compute Cloud    ap-southeast-1a 6/1/16 0:00     0.02
14  Amazon Elastic Compute Cloud    ap-southeast-1b 6/1/16 0:00     0.02
15  Amazon Elastic Compute Cloud                    6/1/16 0:00     0

2 个答案:

答案 0 :(得分:2)

我认为您需要groupby aggregate {lenAvailabilityZonesumBlendedCost

print (df.groupby(['UsageStartDate', 'AvailabilityZone', 'ProductName'])
         .agg({'AvailabilityZone':len,
               'BlendedCost':sum}))

样品:

import pandas as pd

raw_data = {
    'ProductName': ['ASQS', 'AWIS', 'AWIS', 'AECC', 'AECC'], 
    'UsageStartDate': ['6/1/16','6/1/16','6/1/16','6/1/16','6/1/16'],
    'AvailabilityZone':['ap-northeast-1a','ap-northeast-1a','ap-northeast-1a','ap-southeast-2','ap-southeast-2'],
    'BlendedCost':[1,2,3,4,5]}
df = pd.DataFrame(raw_data)
print (df)
  AvailabilityZone  BlendedCost ProductName UsageStartDate
0  ap-northeast-1a            1        ASQS         6/1/16
1  ap-northeast-1a            2        AWIS         6/1/16
2  ap-northeast-1a            3        AWIS         6/1/16
3   ap-southeast-2            4        AECC         6/1/16
4   ap-southeast-2            5        AECC         6/1/16

print (df.groupby(['UsageStartDate', 'AvailabilityZone', 'ProductName'])
         .agg({'AvailabilityZone':len,'BlendedCost':sum})
         .rename(columns={'AvailabilityZone':'Sum of products used', 'BlendedCost':'Total'})
         .reset_index())

  UsageStartDate AvailabilityZone ProductName  Sum of products used  Total
0         6/1/16  ap-northeast-1a        ASQS                     1      1
1         6/1/16  ap-northeast-1a        AWIS                     2      5
2         6/1/16   ap-southeast-2        AECC                     2      9

样本数据解决方案:

import pandas as pd
import io

temp=u"""ProductName;AvailabilityZone;UsageStartDate;BlendedCost
Amazon Simple Queue Service;;6/1/16 0:00;0
Alexa Web Information Service;;6/1/16 0:00;0.00347032
Amazon DynamoDB;ap-southeast-2;6/1/16 0:00;0
Amazon DynamoDB;ap-southeast-2;6/1/16 0:00;0
Amazon Elastic Compute Cloud;ap-northeast-1a;6/1/16 0:00;0.1
Amazon Elastic Compute Cloud;ap-northeast-1a;6/1/16 0:00;0.02
Amazon Elastic Compute Cloud;;6/1/16 0:00;0
Amazon Elastic Compute Cloud;;6/1/16 0:00;0
Amazon Elastic Compute Cloud;;6/1/16 0:00;4.70E-06
Amazon Elastic Compute Cloud;;6/1/16 0:00;8.00E-08
Amazon Elastic Compute Cloud;;6/1/16 0:00;0.00133333
Amazon Elastic Compute Cloud;;6/1/16 0:00;0.005
Amazon Elastic Compute Cloud;ap-southeast-1a;6/1/16 0:00;0.02
Amazon Elastic Compute Cloud;ap-southeast-1a;6/1/16 0:00;0.02
Amazon Elastic Compute Cloud;ap-southeast-1b;6/1/16 0:00;0.02
Amazon Elastic Compute Cloud;;6/1/16 0:00;0"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", index_col=None

#print (df)
print (df.groupby(['UsageStartDate', 'AvailabilityZone', 'ProductName'])
         .agg({'AvailabilityZone':len,'BlendedCost':sum})
         .rename(columns={'AvailabilityZone':'Sum of products used', 'BlendedCost':'Total'})
         .reset_index())

  UsageStartDate AvailabilityZone                   ProductName  \
0    6/1/16 0:00  ap-northeast-1a  Amazon Elastic Compute Cloud   
1    6/1/16 0:00  ap-southeast-1a  Amazon Elastic Compute Cloud   
2    6/1/16 0:00  ap-southeast-1b  Amazon Elastic Compute Cloud   
3    6/1/16 0:00   ap-southeast-2               Amazon DynamoDB   

   Sum of products used  Total  
0                     2   0.12  
1                     2   0.04  
2                     1   0.02  
3                     2   0.00  

答案 1 :(得分:-2)

以下是general aggregation framework for pandaspandas.groupby功能

上的文档

将来,请阅读如何提问great question before asking

funk.groupby(['AvailabilityZone','Date','ProductName'])['BlendedCost'].sum()