Question

我的csv文件有3列（客户ID，说明，单价）。我想获得每个客户购买的最有价值的产品（最大单价）及其价格。

我已从此处下载数据库：

https://archive.ics.uci.edu/ml/datasets/Online%20Retail

我为此编写了一个代码，它可以工作，但是老实说，我不知道为什么会工作，而且看起来有点傻。我希望看到所有树（CustomerID，Description和UnitPrice）作为最终结果表。有没有更好的方法可以做到这一点？

import pandas as pd

my_dataFrame = pd.read_csv("OnlineRetailNEW.csv", dtype={'CustomerID': object})

#the most valuable product that each customer bought, and its price
def get_most_valuable_product():

    most_valuable = my_dataFrame.groupby(["CustomerID", "Description"], sort=False)["UnitPrice"].max().reset_index()
    most_valuable = most_valuable.groupby(["CustomerID"]).max().reset_index()
    return most_valuable

print(get_most_valuable_product())

我已经尝试过了，但是效果并不理想：

def get_most_valuable_product():

    most_valuable = my_dataFrame[["CustomerID", "Description", "UnitPrice"]].sort_values('UnitPrice').groupby(['CustomerID']).tail(1)
    return most_valuable

print(get_most_valuable_product())

Answer 1

my_dataframe[[CustomerID, Description, UnitPrice]].sort_values('UnitPrice').groupby(['CustomerID']).tail(1)

如果我们按单位价格排序，然后按ID进行分组，则最昂贵的价格将始终位于每个客户组的底部。

Answer 2

您可以使用most_valuable.groupby(["CustomerID"]).third_column_name.max()

Answer 3

您可以使用idxmax

maxids = my_dataFrame.groupby(['CustomerID', 'Description'].UnitPrice.idxmax()

my_dataFrame.loc(maxids.values)

请注意，idxmax每个组只给您一条记录。如果要所有记录（最多多个），请使用transform

maxvals = my_dataFrame.groupby(['CustomerID', 'Description'].UnitPrice.transform(lambda x: x.max())

my_dataFrame[my_dataFrame.UnitPrice == maxvals]

在两列上使用groupby后，获取第三列的最大值

3 个答案: