使用类似格式的csv(总csv约为500 x~600,000),因此缺少列:
Sales market_id product_id
0 38 10001516 1132679
1 49 10001516 1138767
2 6 10001516 1132679
... ... ...
9969 245732 1002123 1383020
9970 247093 1006821 1383020
等
并按如下方式阅读:
df0=pd.read_csv('all_final_decomps2_small.csv', low_memory=False, encoding='iso8859_15')
我正在尝试使用最大销售额找到每个market_id的product_id。为此,我需要将销售额与相同的product_id相加,而market_id可以显示在多行中。
我试过这个产生了每个市场中产品的总和:
df_sales=df0[['Sales','market_id','product_id']]
df_sales.groupby(['market_id', 'product_id'])['Sales'].sum()
如此(缩短):
market_id product_id
1006174 1132679 2789
1382460 4586
1382691 49
1383020 269138089
1006638 1132679 5143156
1382460 387250
1383020 204456809
10002899 1132679 630
1382464 220
使用:
df_sales.groupby(['market_id', 'product_id'])['Sales'].sum().max()
返回总和的最大值而没有别的,所以在这个例子中它将返回269138089.我想返回这样的内容:
market_id product_id max_sales
1006174 1383020 269138089
1006638 1383020 204456809
10002899 1132679 630
我尝试了很多不同的东西但是我似乎无法为这个例子做任何工作,所以我很感激任何帮助(如果以前有人问过,我很抱歉)。
我正在使用: Python 3.6.1 :: Anaconda 4.4.0(64位)
答案 0 :(得分:2)
在idxmax
groupby
设置
import pandas as pd
from io import StringIO
txt = """market_id product_id Sales
1006174 1132679 2789
1006174 1382460 4586
1006174 1382691 49
1006174 1383020 269138089
1006638 1132679 5143156
1006638 1382460 387250
1006638 1383020 204456809
10002899 1132679 630
10002899 1382464 220"""
sales = pd.read_csv(StringIO(txt), delim_whitespace=True, index_col=[0, 1], squeeze=True)
解决方案
sales.loc[sales.groupby(level=0).idxmax()]
market_id product_id
1006174 1383020 269138089
1006638 1383020 204456809
10002899 1132679 630
Name: Sales, dtype: int64
或者
sales.loc[sales.groupby(level=0).idxmax()].reset_index(name='max_sales')
market_id product_id max_sales
0 1006174 1383020 269138089
1 1006638 1383020 204456809
2 10002899 1132679 630
答案 1 :(得分:0)
不知怎的设法得到了这个 - 我不确定它是否是最好的方法,但它适用于我的数据:
df0=pd.read_csv('test.csv', low_memory=False, encoding='iso8859_15')
#Rank all items in each market by total sales
df_sales=df0[['Sales', 'market_id', 'product_id']] # int, int, int
# groups sales by market and product and sums product sales
gr_sales = df_sales.groupby(['market_id', 'product_id'], as_index = False).sum()
# gets the product sales in each market and sorts in order of decreasing sales
gr_sales = gr_sales.groupby('market_id').apply(pd.DataFrame.sort_values, 'Sales', ascending = False)
# Finds the product id with the maximum sales in each market
max_sales = gr_sales.groupby('market_id').max()
给我:
In[621]: max_sales
Out[621]:
market_id product_id Sales
0 1006174 1383020 269138089
1 1006638 1383020 1330070614
2 1006678 1383020 58548417
3 1006684 1383020 215858049
4 1006692 1383020 21799689
5 1006732 1383020 58548417
6 1006733 1383020 58548417
7 1006739 1383020 215858049
8 1006819 1383020 605951504
9 1006820 1383020 59083807
10 1006821 1383020 25116872
11 1050511 1382672 6201692
12 1050512 1382672 5468317
13 10001493 1383020 21799689
14 10001516 1383020 204456809
15 10002899 1383020 62413425
和(缩短的例子):
In[624]: gr_sales
Out[624]:
market_id product_id Sales
market_id
1006174 11 1006174 1383020 269138089
9 1006174 1382672 5070111
5 1006174 1382536 2442639
7 1006174 1382602 1108361
6 1006174 1382557 158488
8 1006174 1382651 17214
1 1006174 1382460 4586
0 1006174 1132679 2789
3 1006174 1382490 799
2 1006174 1382464 105
10 1006174 1382691 49
4 1006174 1382522 16
1006638 28 1006638 1383020 1330070614
25 1006638 1382672 109679596
12 1006638 1132679 5143156
17 1006638 1382536 4885278
22 1006638 1382620 2668948
21 1006638 1382602 2216722
18 1006638 1382538 992228
13 1006638 1382460 387250
19 1006638 1382557 316976
23 1006638 1382651 39616
26 1006638 1382674 22388
20 1006638 1382573 7412
15 1006638 1382490 1598
14 1006638 1382464 758
24 1006638 1382665 120
27 1006638 1382691 98
16 1006638 1382522 32
1006678 32 1006678 1383020 58548417
... ... ...
[117 rows x 3 columns]
我不知道如何从gr_sales输出中删除任意索引(正好在中间,这有点烦人),或者来自max_sales表