我的航班数据集包含" UNIQUE_CARRIER_NAME"," MONTH_YEAR"," ROUTE"和其他属性,如乘客数等,在这种情况下与我无关。以下是一个示例(2017年还有许多其他航空公司和日期范围):
UNIQUE_CARRIER_NAME MONTH_YEAR ROUTE
2512 ATA Airlines d/b/a ATA 2-1990 OGG-HNL
2648 ATA Airlines d/b/a ATA 2-1990 IND-RSW
2649 ATA Airlines d/b/a ATA 2-1990 IND-RSW
2650 ATA Airlines d/b/a ATA 2-1990 IND-RSW
3104 ATA Airlines d/b/a ATA 2-1990 HNL-SFO
3470 ATA Airlines d/b/a ATA 2-1990 SFO-HNL
3482 ATA Airlines d/b/a ATA 2-1990 SFO-OGG
4522 ATA Airlines d/b/a ATA 3-1990 OGG-HNL
5076 ATA Airlines d/b/a ATA 2-1990 RSW-IND
5077 ATA Airlines d/b/a ATA 2-1990 RSW-IND
5078 ATA Airlines d/b/a ATA 2-1990 RSW-IND
5296 ATA Airlines d/b/a ATA 3-1990 RSW-IND
5297 ATA Airlines d/b/a ATA 3-1990 RSW-IND
5371 ATA Airlines d/b/a ATA 3-1990 SFO-HNL
5389 ATA Airlines d/b/a ATA 3-1990 SFO-OGG
....
我希望能够和#34; UNIQUE_CARRIER_NAME"," MONTH_YEAR"," ROUTE"在Python中的这个序列。我写了这个:
carrier_groups = df.groupby(["UNIQUE_CARRIER_NAME","MONTH_YEAR","ROUTE])
这会返回一个DataFrameGroupBy对象,我可以用它来迭代以对路由数据执行一些计算 - 无论如何我可以选择不聚合数据(对于其余列)并且只选择其中的唯一路径这个groupby函数?这3行应仅选为1。
2648 ATA Airlines d/b/a ATA 2-1990 IND-RSW
2649 ATA Airlines d/b/a ATA 2-1990 IND-RSW
2650 ATA Airlines d/b/a ATA 2-1990 IND-RSW
我想迭代这组DataFrame按" UNIQUE_CARRIER_NAME"," MONTH_YEAR"这样我就有了:
for each group of DataFrame:
I have a subset of df which I can run a function on ROUTE to get some results
答案 0 :(得分:2)
我认为首先需要drop_duplicates
,然后需要apply
你的函数(只有一些示例函数,因为没有关于它的信息):
def func(x):
print (x)
#apply your function
#some sample function
x['ROUTE'] = x['ROUTE'] + 'a'
return x
df = df.drop_duplicates(['UNIQUE_CARRIER_NAME','MONTH_YEAR','ROUTE'])
df = df.apply(func, axis=1)
print (df)
UNIQUE_CARRIER_NAME MONTH_YEAR ROUTE
2512 ATA Airlines d/b/a ATA 2-1990 OGG-HNLa
2648 ATA Airlines d/b/a ATA 2-1990 IND-RSWa
3104 ATA Airlines d/b/a ATA 2-1990 HNL-SFOa
3470 ATA Airlines d/b/a ATA 2-1990 SFO-HNLa
3482 ATA Airlines d/b/a ATA 2-1990 SFO-OGGa
4522 ATA Airlines d/b/a ATA 3-1990 OGG-HNLa
5076 ATA Airlines d/b/a ATA 2-1990 RSW-INDa
5296 ATA Airlines d/b/a ATA 3-1990 RSW-INDa
5371 ATA Airlines d/b/a ATA 3-1990 SFO-HNLa
5389 ATA Airlines d/b/a ATA 3-1990 SFO-OGGa
答案 1 :(得分:1)
不需要分组。只需使用以下方法删除数据框中的欺骗:
df = df.drop_duplicates(subset=['UNIQUE_CARRIER_NAME','MONTH_YEAR','ROUTE'])