我有一个这样的DF:
id company duration
0 Other Company 5
0 Other Company 19
0 X Company 7
1 Other Company 24
1 Other Company 6
1 X Company 12
2 X Company 9
3 Other Company 30
3 X Company 16
我需要按ID和Company对DF进行分组,然后对它们的持续时间求和。最后,我只需要带有“ X Company”的值。这就是我所做的:
import pandas as pd
jobs = pd.read_csv("data/jobs.csv")
time_in_company = jobs.groupby(['id','company'])['duration'].agg(sum)
得到了:
id company duration
0 Other Company 24
0 X Company 7
1 Other Company 30
1 X Company 12
2 X Company 9
3 Other Company 30
3 X Company 16
现在,我需要从“其他公司”中删除所有条目。已经尝试使用time_in_company.drop('Any Company')#Return KeyError'Any Company'
尝试了.set_index('company'),以便尝试其他操作,但是它告诉我“系列”对象没有属性“ set_index”
试图在groupby中使用.filter(),但我需要.agg(sum)。 (而且还是无法正常工作。.
有人可以帮我弄清楚这个问题吗?预先感谢。
答案 0 :(得分:1)
有帮助吗?
time_in_company= time_in_company.reset_index(level='company')
time_in_company [time_in_company ['company']!="Other Company"]
答案 1 :(得分:0)
首先使用pd.query()删除“ X Company”行,然后使用groupby删除其余df,例如:
import numpy as np
import pandas as pd
ids = [0,0,0,1,1,1,2,3,3]
company = ['Other Company','Other Company','X Company','Other Company','Other Company','X Company','X Company','Other Company','X Company']
duration = [5,19,7,24,6,12,9,30,16]
df = pd.DataFrame({'ids':ids,'company':company,'duration':duration})
df.query("company=='Other Company'").groupby(['ids','company'])['duration'].agg(sum)
您得到:
ids company
0 Other Company 24
1 Other Company 30
3 Other Company 30
Name: duration, dtype: int64
编辑:此外,您可以将pd.where(),dropna()和pd.pivot_table()的组合使用:
df.where(df['company']=='Other Company').dropna().pivot_table(['duration'],index=['ids','company'],aggfunc='sum')
您得到:
duration
ids company
0.0 Other Company 24.0
1.0 Other Company 30.0
3.0 Other Company 30.0
尽管如此,第一个更快:
每个循环2.03 ms±62.3 µs(平均±标准偏差,共运行7次,每个循环100个)
每个循环5.87 ms±23.4 µs(平均±标准偏差,共运行7次,每个循环100个)