熊猫-如何分组和删除特定行

时间:2018-11-23 19:23:30

标签: python pandas

我有一个这样的DF:

id     company     duration
0    Other Company    5
0    Other Company    19
0    X Company        7
1    Other Company    24
1    Other Company    6
1    X Company        12
2    X Company        9
3    Other Company    30
3    X Company        16

我需要按ID和Company对DF进行分组,然后对它们的持续时间求和。最后,我只需要带有“ X Company”的值。这就是我所做的:

import pandas as pd
jobs = pd.read_csv("data/jobs.csv")
time_in_company = jobs.groupby(['id','company'])['duration'].agg(sum)

得到了:

id     company     duration
0    Other Company    24
0    X Company        7
1    Other Company    30
1    X Company        12
2    X Company        9
3    Other Company    30
3    X Company        16

现在,我需要从“其他公司”中删除所有条目。已经尝试使用time_in_company.drop('Any Company')#Return KeyError'Any Company'

尝试了.set_index('company'),以便尝试其他操作,但是它告诉我“系列”对象没有属性“ set_index”

试图在groupby中使用.filter(),但我需要.agg(sum)。 (而且还是无法正常工作。.

有人可以帮我弄清楚这个问题吗?预先感谢。

2 个答案:

答案 0 :(得分:1)

有帮助吗?

time_in_company= time_in_company.reset_index(level='company')
time_in_company [time_in_company ['company']!="Other Company"] 

答案 1 :(得分:0)

首先使用pd.query()删除“ X Company”行,然后使用groupby删除其余df,例如:

import numpy as np
import pandas as pd


ids = [0,0,0,1,1,1,2,3,3]
company = ['Other Company','Other Company','X Company','Other Company','Other Company','X Company','X Company','Other Company','X Company']
duration = [5,19,7,24,6,12,9,30,16]

df = pd.DataFrame({'ids':ids,'company':company,'duration':duration})


df.query("company=='Other Company'").groupby(['ids','company'])['duration'].agg(sum)

您得到:

ids  company      
0    Other Company    24
1    Other Company    30
3    Other Company    30
Name: duration, dtype: int64

编辑:此外,您可以将pd.where()dropna()pd.pivot_table()的组合使用:

df.where(df['company']=='Other Company').dropna().pivot_table(['duration'],index=['ids','company'],aggfunc='sum')

您得到:

duration
ids company                
0.0 Other Company      24.0
1.0 Other Company      30.0
3.0 Other Company      30.0

尽管如此,第一个更快:
每个循环2.03 ms±62.3 µs(平均±标准偏差,共运行7次,每个循环100个)
每个循环5.87 ms±23.4 µs(平均±标准偏差,共运行7次,每个循环100个)