How can I get the most popular item in a group in pandas?

时间:2019-01-15 18:13:08

标签: python pandas reshape series

I have a Pandas Dataframe containing cars for sale and I'd like to get the most popular for each brand, however I seem unable to do this.

I have a pandas dataframe with some columns (e.g: vehicle type, price, mileage, year, brand, model, etc) and for each car brand, I'd like to check which model occurs the most. I've tried to use a groupby, like this:

popular_models = dataset.groupby('brand').model.value_counts().groupby(level=0).nlargest(1)

But it returns a Pandas Series in which some of the data I want is stored in the indices and it also adds one repeated column that is not making any sense to me.

I'd like to get a a DataFrame containing 3 columns, like this:

(https://imgur.com/a/BkKBrv9)

However, I'm getting a pandas series like this:

(https://imgur.com/a/u8CSXY4)

Can someone please help me figure this out?

3 个答案:

答案 0 :(得分:1)

您必须对要保留的两个对象进行分组,然后计算要查找其出现的对象。这是示例输入文件:

Brand   Model
Acura   RDX
Acura   RDX
Acura   RDX
Acura   RDX
Acura   RDX
Acura   RDX
Acura   RDX
Acura   RDX
Acura   RDX
Acura   RDX
Beach   Baby
Beach   Baby
Beach   Baby
Beach   Baby
Beach   Baby
Beach   Baby
Beach   Baby
Beach   Baby
Beach   Baby
Beach   Baby
BMW     320i
BMW     320i
BMW     320i
BMW     320i
BMW     320i
BMW     320i
BMW     320i
BMW     550i
BMW     550i
BMW     550i
BMW     550i
BMW     550i
BMW     550i
BMW     550i
Cadillac        Escalade
Cadillac        Escalade
Cadillac        Escalade
Chana   Cargo
Chana   Cargo
Chana   Cargo
Chana   Cargo
Chana   Cargo
Chana   Cargo
Chana   Cargo
Chana   Cargo
Chana   Cargo
Chana   Cargo
Chana   Cargo
Chana   Cargo

简单的大熊猫一只班轮:

df = pd.read_table('fun.txt', header=0)
print(df.groupby(['Brand','Model'])['Model'].agg(['count']))

输出:

                   count
Brand    Model
Acura    RDX          10
BMW      320i          7
         550i          7
Beach    Baby         10
Cadillac Escalade      3
Chana    Cargo        12

如果要按频率对值进行排序(从最大到最小),并且只保留最大的值,将单线更改为:

groupby_df = (df.groupby(['Brand','Model'])['Model'].agg(['count']).sort_values(by='count', ascending=False).reset_index().drop_duplicates('Brand', keep='first'))

获得:

      Brand     Model  count
0     Chana     Cargo     12
1     Acura       RDX     10
2     Beach      Baby     10
3       BMW      320i      7
5  Cadillac  Escalade      3

答案 1 :(得分:1)

一种解决方案是对groupby操作进行排序,然后删除重复项:

df = pd.DataFrame({'Brand': ['B1'] * 5 + ['B2'] * 5,
                   'Model': ['M1', 'M2', 'M1', 'M2', 'M3',
                             'N1', 'N1', 'N2', 'N3', 'N1']})

df['Count'] = df.groupby(['Brand', 'Model'])['Model'].transform('count')

res = df.sort_values('Count', ascending=False)\
        .drop_duplicates('Brand')

print(res)

#   Brand Model  Count
# 5    B2    N1      3
# 0    B1    M1      2

请注意,这会删除重复的分组最高计数。

答案 2 :(得分:0)

这是一种方法。

  1. 设置DataFrameGroupBy对象:

    import React from "react"; import ReactDOM from "react-dom"; import App from "./components/app"; import './index.css'; import 'bootstrap/dist/css/bootstrap.css'; ReactDOM.render(<App />, document.getElementById("root"));

  2. 使用GroupBy df.groupby(["Brand", "Model"])函数计算每个子组的大小(以系列形式返回):

    size

  3. 在命名包含由df.groupby(["Brand", "Model"]).size()计算的值的列的同时转换回DataFrame:

    size

  4. 按照df.groupby(["Brand", "Model"]).size().reset_index(name="Count")子组项目的降序对DataFrame进行排序:

    Count

  5. 拖放重复的df.groupby(["Brand", "Model"]).size().reset_index(name="Count").sort_values(by="Count", ascending=False)值,将第一个条目保留在DataFrame中:

    Brand