如何用python pandas中的一些编码计算一列的总和?

时间:2017-09-12 07:19:00

标签: python pandas dataframe

我有一个pandas Dataframe,如下所示:

df = pd.DataFrame({'sport_name': ['football','football','football','football','football','cricket','cricket','cricket','cricket'],
       'person_name': ['ramesh','ramesh','ramesh','ramesh','mohit','mahesh','mahesh','mahesh','mahesh'],
           'city': ['mumbai', 'mumbai','delhi','delhi','pune','surat','surat','panji','panji'],
    'person_symbol': ['ram','mum','mum','ram','moh','mah','sur','sur','mah'],
    'person_count': ['10','14','25','20','11','34','23','43','34']})
df = df[['sport_name','person_name','city','person_symbol','person_count']]
print df

  sport_name person_name    city person_symbol person_count
0   football      ramesh  mumbai           ram           10
1   football      ramesh  mumbai           mum           14
2   football      ramesh   delhi           mum           25
3   football      ramesh   delhi           ram           20
4   football       mohit    pune           moh           11
5    cricket      mahesh   surat           mah           34
6    cricket      mahesh   surat           sur           23
7    cricket      mahesh   panji           sur           43
8    cricket      mahesh   panji           mah           34

从这个Dataframe中,我想创建两个名为" derived_symbol "的列数据框。和" person_count "。为了创建它,我需要关注以下某些条件:

  • derived_symbol需要为每个独特的城市和person_symbol形成。
  • person_count是根据derived_symbol的内容计算的。

示例:

考虑第一组Dataframe,即sport_name = football和person_name = rakesh这是四行:

  sport_name person_name    city person_symbol person_count
0   football      ramesh  mumbai           ram           10
1   football      ramesh  mumbai           mum           14
2   football      ramesh   delhi           mum           25
3   football      ramesh   delhi           ram           20

在上面四行中,如果我们查看city和person_symbol列值,我们可以看到四个不同的唯一值,分别是mumbai,delhi,ram和mum。所以对于以上四行,我们的输出如下所示:

derived_symbol                     person_count
football.ramesh.TOTAL.mumbai_count  24
football.ramesh.TOTAL.delhi_count   45
football.ramesh.TOTAL.ram_count     30
football.ramesh.TOTAL.mum_count     39

最终预期结果:

derived_symbol                      person_count
football.ramesh.TOTAL.mumbai_count  24
football.ramesh.TOTAL.delhi_count   45
football.ramesh.TOTAL.ram_count     30
football.ramesh.TOTAL.mum_count     39
football.mohit.TOTAL.pune_count     11
football.mohit.TOTAL.moh_count      11
cricket.mahesh.TOTAL.surat_count    57
cricket.mahesh.TOTAL.panji_count    77
cricket.mahesh.TOTAL.sur_count      66
cricket.mahesh.TOTAL.mah_count      68

修改:Dataframe的日期如下:

df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football'],
       'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh'],
           'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi'],
    'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram'],
    'person_count': ['10','14','25','20','34','23','43','34'],
    'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-26','2017-02-26','2017-02-26','2017-02-26']})
df = df[['sport_name','person_name','city','person_symbol','person_count','month']]
print df

  sport_name person_name    city person_symbol person_count       month
0   football      ramesh  mumbai           ram           10  2017-01-23
1   football      ramesh  mumbai           mum           14  2017-01-23
2   football      ramesh   delhi           mum           25  2017-01-23
3   football      ramesh   delhi           ram           20  2017-01-23
4   football      ramesh  mumbai           ram           34  2017-02-26
5   football      ramesh  mumbai           mum           23  2017-02-26
6   football      ramesh   delhi           mum           43  2017-02-26
7   football      ramesh   delhi           ram           34  2017-02-26

预期输出:

derived_symbol              person_count    month
football.ramesh.TOTAL.mumbai_count  24      2017-01-23
football.ramesh.TOTAL.delhi_count   45      2017-01-23
football.ramesh.TOTAL.ram_count     30      2017-01-23
football.ramesh.TOTAL.mum_count     39      2017-01-23
football.ramesh.TOTAL.mumbai_count  57      2017-02-26
football.ramesh.TOTAL.delhi_count   77      2017-02-26
football.ramesh.TOTAL.ram_count     68      2017-02-26
football.ramesh.TOTAL.mum_count     66      2017-02-26

我按照以下方式计算了person_count月份:

df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football'],
           'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh'],
               'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi'],
        'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram'],
        'person_count': ['10','14','25','20','34','23','43','34'],
        'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-26','2017-02-26','2017-02-26','2017-02-26']})
df = df[['sport_name','person_name','city','person_symbol','person_count','month']]

df['person_count'] = df['person_count'].astype(int)

df1=df.set_index(['sport_name','person_name','person_count','month']).stack().reset_index(name='val')

df1['derived_symbol'] = df1['sport_name'] + '.' + df1['person_name'] + '.TOTAL.' + df1['val'] + '_count'

df2 = df1.groupby(['derived_symbol','month'])['person_count'].sum().reset_index(name='person_count')
print (df2)

2 个答案:

答案 0 :(得分:2)

这是单程

首先将person_count类型更改为数字

In [2126]: df.person_count = df.person_count.astype(int)

重塑您的数据,使cityperson_symbol置于一个级别,然后groupby获取总计数。

In [2127]: dff = (df.melt(id_vars=['sport_name', 'person_name', 'person_count'])
                    .groupby(['sport_name', 'person_name', 'value']).person_count.sum())

In [2128]: dff
Out[2128]:
sport_name  person_name  value
cricket     mahesh       mah       68
                         panji     77
                         sur       66
                         surat     57
football    mohit        moh       11
                         pune      11
            ramesh       delhi     45
                         mum       39
                         mumbai    24
                         ram       30
Name: person_count, dtype: int32

format自定义格式的索引级别。

In [2129]: dff.index = ['{0}.{1}.TOTAL.{2}_count'.format(*idx) for idx in dff.index]

In [2130]: dff
Out[2130]:
cricket.mahesh.TOTAL.mah_count        68
cricket.mahesh.TOTAL.panji_count      77
cricket.mahesh.TOTAL.sur_count        66
cricket.mahesh.TOTAL.surat_count      57
football.mohit.TOTAL.moh_count        11
football.mohit.TOTAL.pune_count       11
football.ramesh.TOTAL.delhi_count     45
football.ramesh.TOTAL.mum_count       39
football.ramesh.TOTAL.mumbai_count    24
football.ramesh.TOTAL.ram_count       30
Name: person_count, dtype: int32

答案 1 :(得分:2)

使用:

#convert column to int
df['person_count'] = df['person_count'].astype(int)

#reshape for one column from city and person_symbol columns 
df1=df.set_index(['sport_name','person_name','person_count']).stack().reset_index(name='val')
print (df1)
   sport_name person_name  person_count        level_3     val
0    football      ramesh            10           city  mumbai
1    football      ramesh            10  person_symbol     ram
2    football      ramesh            14           city  mumbai
3    football      ramesh            14  person_symbol     mum
4    football      ramesh            25           city   delhi
5    football      ramesh            25  person_symbol     mum
6    football      ramesh            20           city   delhi
7    football      ramesh            20  person_symbol     ram
8    football       mohit            11           city    pune
9    football       mohit            11  person_symbol     moh
10    cricket      mahesh            34           city   surat
11    cricket      mahesh            34  person_symbol     mah
12    cricket      mahesh            23           city   surat
13    cricket      mahesh            23  person_symbol     sur
14    cricket      mahesh            43           city   panji
15    cricket      mahesh            43  person_symbol     sur
16    cricket      mahesh            34           city   panji
17    cricket      mahesh            34  person_symbol     mah
#concatenate columns
a = df1['sport_name'] + '.' + df1['person_name'] + '.TOTAL.' + df1['val'] + '_count'

#groupby by Series a and aggregate sum
df2 = df1['person_count'].groupby(a.rename('derived_symbol'), sort=False)
                         .sum()
                         .reset_index(name='person_count')
print (df2)
                       derived_symbol  person_count
0  football.ramesh.TOTAL.mumbai_count            24
1     football.ramesh.TOTAL.ram_count            30
2     football.ramesh.TOTAL.mum_count            39
3   football.ramesh.TOTAL.delhi_count            45
4     football.mohit.TOTAL.pune_count            11
5      football.mohit.TOTAL.moh_count            11
6    cricket.mahesh.TOTAL.surat_count            57
7      cricket.mahesh.TOTAL.mah_count            68
8      cricket.mahesh.TOTAL.sur_count            66
9    cricket.mahesh.TOTAL.panji_count            77