如何在pandas中执行groupby并选择unique?

时间:2017-10-19 04:22:06

标签: python pandas csv dataframe

我的航班数据集包含" UNIQUE_CARRIER_NAME"," MONTH_YEAR"," ROUTE"和其他属性,如乘客数等,在这种情况下与我无关。以下是一个示例(2017年还有许多其他航空公司和日期范围):

           UNIQUE_CARRIER_NAME MONTH_YEAR    ROUTE
2512    ATA Airlines d/b/a ATA     2-1990  OGG-HNL
2648    ATA Airlines d/b/a ATA     2-1990  IND-RSW
2649    ATA Airlines d/b/a ATA     2-1990  IND-RSW
2650    ATA Airlines d/b/a ATA     2-1990  IND-RSW
3104    ATA Airlines d/b/a ATA     2-1990  HNL-SFO
3470    ATA Airlines d/b/a ATA     2-1990  SFO-HNL
3482    ATA Airlines d/b/a ATA     2-1990  SFO-OGG
4522    ATA Airlines d/b/a ATA     3-1990  OGG-HNL
5076    ATA Airlines d/b/a ATA     2-1990  RSW-IND
5077    ATA Airlines d/b/a ATA     2-1990  RSW-IND
5078    ATA Airlines d/b/a ATA     2-1990  RSW-IND
5296    ATA Airlines d/b/a ATA     3-1990  RSW-IND
5297    ATA Airlines d/b/a ATA     3-1990  RSW-IND
5371    ATA Airlines d/b/a ATA     3-1990  SFO-HNL
5389    ATA Airlines d/b/a ATA     3-1990  SFO-OGG
....

我希望能够和#34; UNIQUE_CARRIER_NAME"," MONTH_YEAR"," ROUTE"在Python中的这个序列。我写了这个:

carrier_groups = df.groupby(["UNIQUE_CARRIER_NAME","MONTH_YEAR","ROUTE])

这会返回一个DataFrameGroupBy对象,我可以用它来迭代以对路由数据执行一些计算 - 无论如何我可以选择不聚合数据(对于其余列)并且只选择其中的唯一路径这个groupby函数?这3行应仅选为1。

2648    ATA Airlines d/b/a ATA     2-1990  IND-RSW
2649    ATA Airlines d/b/a ATA     2-1990  IND-RSW
2650    ATA Airlines d/b/a ATA     2-1990  IND-RSW

我想迭代这组DataFrame按" UNIQUE_CARRIER_NAME"," MONTH_YEAR"这样我就有了:

for each group of DataFrame:
    I have a subset of df which I can run a function on ROUTE to get some results

2 个答案:

答案 0 :(得分:2)

我认为首先需要drop_duplicates,然后需要apply你的函数(只有一些示例函数,因为没有关于它的信息):

def func(x):
    print (x)
    #apply your function 
    #some sample function 
    x['ROUTE'] = x['ROUTE'] + 'a'
    return x 

df = df.drop_duplicates(['UNIQUE_CARRIER_NAME','MONTH_YEAR','ROUTE'])
df = df.apply(func, axis=1)
print (df)
         UNIQUE_CARRIER_NAME MONTH_YEAR     ROUTE
2512  ATA Airlines d/b/a ATA     2-1990  OGG-HNLa
2648  ATA Airlines d/b/a ATA     2-1990  IND-RSWa
3104  ATA Airlines d/b/a ATA     2-1990  HNL-SFOa
3470  ATA Airlines d/b/a ATA     2-1990  SFO-HNLa
3482  ATA Airlines d/b/a ATA     2-1990  SFO-OGGa
4522  ATA Airlines d/b/a ATA     3-1990  OGG-HNLa
5076  ATA Airlines d/b/a ATA     2-1990  RSW-INDa
5296  ATA Airlines d/b/a ATA     3-1990  RSW-INDa
5371  ATA Airlines d/b/a ATA     3-1990  SFO-HNLa
5389  ATA Airlines d/b/a ATA     3-1990  SFO-OGGa

答案 1 :(得分:1)

不需要分组。只需使用以下方法删除数据框中的欺骗:

df = df.drop_duplicates(subset=['UNIQUE_CARRIER_NAME','MONTH_YEAR','ROUTE'])