熊猫分组和求和数据列,用特殊字符分隔

时间:2020-06-26 12:41:40

标签: python pandas

我有一个包含价格,日期和成本类型的数据集。 列“费用类型”元素,以'-'字符分隔。 我想对元素求和并归为A1,A2,A3 ... category。 我在熊猫的stackoverflow中看到了一些问题和答案,但它们都解决了 一个特殊的问题

原始数据框如下:

 price          date      cost type
+ 14,000    1399/03/02   A11 - A1 -A
+ 5,500     1399/02/25   A31 - A3 -A
+ 67,500    1399/02/22   A21 - A2 -A
+ 10,000    1399/02/20   A11 - A1 -A
+ 8,000     1399/02/19   A12 - A1 -A
+ 5,000     1399/02/19   A31 - A3 -A
+ 8,000     1399/02/15   A12 - A1 -A
+ 5,000     1399/02/12   A32 - A3 -A
+ 14,000    1399/02/10   A13 - A1 -A
+ 5,000     1399/02/09   A31 - A3 -A
+ 2,000     1399/02/08   A33 - A3 -A
+ 27,200    1399/02/03   A11 - A1 -A
+ 66,500    1399/01/31   A21 - A2 -A
+ 10,000    1399/01/20   A11 - A1 -A
+ 10,000    1399/01/18   A12 - A1 -A
+ 10,000    1399/01/18   A11 - A1 -A
+ 8,000     1399/01/06   A12 - A1 -A
+ 9,000     1399/01/04   A11 - A1 -A
+ 20,000    1398/12/28   A14 - A1 -A

我想总结和分组
结果数据框如下所示:

CostType(Main )    CostType(Branch )                    Cost
      A                  A1            Sum of all element ( A11 , A12 , A13 , … ) 
                         A2            Sum of all element ( A21 , A22 , A23 , … ) 
                         A3            Sum of all element ( A31 , A32 , A33 , … ) 



1 个答案:

答案 0 :(得分:0)

str.split()拆分您要拆分的列。然后,将其连接到原始数据框。将它们组合在一起并汇总。

import pandas as pd
import numpy as np
import io

data = '''
price date "cost type"
14,000 1399/03/02 "A11 - A1 -A"
5,500 1399/02/25 "A31 - A3 -A"
67,500 1399/02/22 "A21 - A2 -A"
10,000 1399/02/20 "A11 - A1 -A"
8,000 1399/02/19 "A12 - A1 -A"
5,000 1399/02/19 "A31 - A3 -A"
8,000 1399/02/15 "A12 - A1 -A"
5,000 1399/02/12 "A32 - A3 -A"
14,000 1399/02/10 "A13 - A1 -A"
5,000 1399/02/09 "A31 - A3 -A"
2,000 1399/02/08 "A33 - A3 -A"
27,200 1399/02/03 "A11 - A1 -A"
66,500 1399/01/31 "A21 - A2 -A"
10,000 1399/01/20 "A11 - A1 -A"
10,000 1399/01/18  "A12 - A1 -A"
10,000 1399/01/18 "A11 - A1 -A"
8,000 1399/01/06 "A12 - A1 -A"
9,000 1399/01/04 "A11 - A1 -A"
20,000 1398/12/28 "A14 - A1 -A"
'''

df = pd.read_csv(io.StringIO(data), sep='\s+')

df['price'] = df['price'].str.replace(',','').astype(int)
df2 = pd.concat([df[['price','date']], df['cost type'].str.split('-', expand=True)], axis=1)
df2.rename(columns={0:'type_c',1:'type_b',2:'type_a'}, inplace=True)
df2.groupby(['type_a','type_b'])['price'].sum().reset_index()

    type_a  type_b  price
0   A   A1  148200
1   A   A2  134000
2   A   A3  22500