Question

使用类似格式的csv（总csv约为500 x~600,000），因此缺少列：

       Sales  market_id  product_id

0         38   10001516     1132679
1         49   10001516     1138767
2          6   10001516     1132679
     ...        ...         ...
9969  245732    1002123     1383020
9970  247093    1006821     1383020

等并按如下方式阅读： df0=pd.read_csv('all_final_decomps2_small.csv', low_memory=False, encoding='iso8859_15')

我正在尝试使用最大销售额找到每个market_id的product_id。为此，我需要将销售额与相同的product_id相加，而market_id可以显示在多行中。

我试过这个产生了每个市场中产品的总和：

df_sales=df0[['Sales','market_id','product_id']] 
df_sales.groupby(['market_id', 'product_id'])['Sales'].sum()

如此（缩短）：

market_id  product_id
1006174    1132679             2789
           1382460             4586
           1382691               49
           1383020        269138089
1006638    1132679          5143156
           1382460           387250
           1383020        204456809
10002899   1132679              630
           1382464              220

使用：

df_sales.groupby(['market_id', 'product_id'])['Sales'].sum().max()

返回总和的最大值而没有别的，所以在这个例子中它将返回269138089.我想返回这样的内容：

market_id  product_id      max_sales
1006174    1383020        269138089
1006638    1383020        204456809
10002899   1132679              630

我尝试了很多不同的东西但是我似乎无法为这个例子做任何工作，所以我很感激任何帮助（如果以前有人问过，我很抱歉）。

我正在使用： Python 3.6.1 :: Anaconda 4.4.0（64位）

Answer 1

在idxmax

中使用groupby

设置

import pandas as pd
from io import StringIO

txt = """market_id  product_id         Sales
1006174    1132679             2789
1006174    1382460             4586
1006174    1382691               49
1006174    1383020        269138089
1006638    1132679          5143156
1006638    1382460           387250
1006638    1383020        204456809
10002899   1132679              630
10002899   1382464              220"""


sales = pd.read_csv(StringIO(txt), delim_whitespace=True, index_col=[0, 1], squeeze=True)

解决方案

sales.loc[sales.groupby(level=0).idxmax()]

market_id  product_id
1006174    1383020       269138089
1006638    1383020       204456809
10002899   1132679             630
Name: Sales, dtype: int64

或者

sales.loc[sales.groupby(level=0).idxmax()].reset_index(name='max_sales')

   market_id  product_id  max_sales
0    1006174     1383020  269138089
1    1006638     1383020  204456809
2   10002899     1132679        630

Answer 2

不知怎的设法得到了这个 - 我不确定它是否是最好的方法，但它适用于我的数据：

df0=pd.read_csv('test.csv', low_memory=False, encoding='iso8859_15')

#Rank all items in each market by total sales
df_sales=df0[['Sales', 'market_id', 'product_id']] # int, int, int

# groups sales by market and product and sums product sales
gr_sales = df_sales.groupby(['market_id', 'product_id'], as_index = False).sum()

# gets the product sales in each market and sorts in order of decreasing sales
gr_sales = gr_sales.groupby('market_id').apply(pd.DataFrame.sort_values, 'Sales', ascending = False)

# Finds the product id with the maximum sales in each market
max_sales = gr_sales.groupby('market_id').max()

给我：

In[621]: max_sales
Out[621]: 
    market_id  product_id       Sales
0     1006174     1383020   269138089
1     1006638     1383020  1330070614
2     1006678     1383020    58548417
3     1006684     1383020   215858049
4     1006692     1383020    21799689
5     1006732     1383020    58548417
6     1006733     1383020    58548417
7     1006739     1383020   215858049
8     1006819     1383020   605951504
9     1006820     1383020    59083807
10    1006821     1383020    25116872
11    1050511     1382672     6201692
12    1050512     1382672     5468317
13   10001493     1383020    21799689
14   10001516     1383020   204456809
15   10002899     1383020    62413425

和（缩短的例子）：

In[624]: gr_sales
Out[624]: 
               market_id  product_id       Sales
market_id                                       
1006174   11     1006174     1383020   269138089
          9      1006174     1382672     5070111
          5      1006174     1382536     2442639
          7      1006174     1382602     1108361
          6      1006174     1382557      158488
          8      1006174     1382651       17214
          1      1006174     1382460        4586
          0      1006174     1132679        2789
          3      1006174     1382490         799
          2      1006174     1382464         105
          10     1006174     1382691          49
          4      1006174     1382522          16
1006638   28     1006638     1383020  1330070614
          25     1006638     1382672   109679596
          12     1006638     1132679     5143156
          17     1006638     1382536     4885278
          22     1006638     1382620     2668948
          21     1006638     1382602     2216722
          18     1006638     1382538      992228
          13     1006638     1382460      387250
          19     1006638     1382557      316976
          23     1006638     1382651       39616
          26     1006638     1382674       22388
          20     1006638     1382573        7412
          15     1006638     1382490        1598
          14     1006638     1382464         758
          24     1006638     1382665         120
          27     1006638     1382691          98
          16     1006638     1382522          32
1006678   32     1006678     1383020    58548417
                 ...         ...         ...

[117 rows x 3 columns]

我不知道如何从gr_sales输出中删除任意索引（正好在中间，这有点烦人），或者来自max_sales表

在两列~Python上使用groupby（）时获取每个组的最大值

2 个答案: