我正在尝试查找数量最大的月份(“月份”列)(在DepDelay列中)
数据
flightID Month ArrTime ActualElapsedTime DepDelay ArrDelay
BBYYEUVY67527 1 1514.0 58.0 NA 64.0
MUPXAQFN40227 1 37.0 120.0 13 52.0
LQLYUIMN79169 1 916.0 166.0 NA -25.0
KTAMHIFO10843 1 NaN NaN 5 NaN
BOOXJTEY23623 1 NaN NaN 4 NaN
BBYYEUVY67527 2 1514.0 58.0 NA 64.0
MUPXAQFN40227 2 37.0 120.0 NA 52.0
LQLYUIMN79169 2 916.0 166.0 NA -25.0
KTAMHIFO10843 2 NaN NaN 15 NaN
BOOXJTEY23623 2 NaN NaN 4 NaN
我尝试过:
data = pd.read_csv('data.csv', sep='\t')
dep_delay = all_data.groupby(["Month"].DepDelay.count().max())
print(dep_delay)
错误:
AttributeError Traceback (most recent call last)
<ipython-input-14-2ea6213009d6> in <module>()
----> 1 dep_delay = all_data.groupby(["Month"].DepDelay.count().max())
2
3 print(dep_delay)
AttributeError: 'list' object has no attribute 'DepDelay'
好的输出:
Month DepDelay
1 22
答案 0 :(得分:5)
您需要sum
而不是count
才能按组对值求和。这是使用GroupBy
+ sum
然后使用idxmax
的一种方法:
res = df.groupby('Month')['DepDelay'].sum().reset_index()
res = res.loc[[res['DepDelay'].idxmax()]]
print(res)
Month DepDelay
0 1 22.0
或者,您可以进行分组和排序,然后提取第一行:
res = df.groupby('Month')['DepDelay'].sum()\
.sort_values(ascending=False).head(1)\
.reset_index()
print(res)
Month DepDelay
0 1 22.0
答案 1 :(得分:2)
要使代码正常运行,请更改
dep_delay = all_data.groupby(["Month"].DepDelay.count().max())
到
dep_delay = all_data.groupby(["Month"]).DepDelay.count().max()
要找到您的解决方案:
idx = all_data['DepDelay'].idxmax()
all_data.loc[[idx], ['Month', 'DepDelay']]
输出
Month DepDelay
8 2 15.0
答案 2 :(得分:2)
另一种方法:
pd.DataFrame(df.loc[df['DepDelay'].idxmax(), ['Month', 'DepDelay']]).T
# Month DepDelay
#8 2 15
您可以重置索引以将8
更改为0
:
pd.DataFrame(df.loc[df['DepDelay'].idxmax(), ['Month', 'DepDelay']]).T.reset_index(drop=True)
# Month DepDelay
#0 2 15