我有一个pandas Dataframe,如下所示:
df = pd.DataFrame({'sport_name': ['football','football','football','football','football','cricket','cricket','cricket','cricket'],
'person_name': ['ramesh','ramesh','ramesh','ramesh','mohit','mahesh','mahesh','mahesh','mahesh'],
'city': ['mumbai', 'mumbai','delhi','delhi','pune','surat','surat','panji','panji'],
'person_symbol': ['ram','mum','mum','ram','moh','mah','sur','sur','mah'],
'person_count': ['10','14','25','20','11','34','23','43','34']})
df = df[['sport_name','person_name','city','person_symbol','person_count']]
print df
sport_name person_name city person_symbol person_count
0 football ramesh mumbai ram 10
1 football ramesh mumbai mum 14
2 football ramesh delhi mum 25
3 football ramesh delhi ram 20
4 football mohit pune moh 11
5 cricket mahesh surat mah 34
6 cricket mahesh surat sur 23
7 cricket mahesh panji sur 43
8 cricket mahesh panji mah 34
从这个Dataframe中,我想创建两个名为" derived_symbol "的列数据框。和" person_count "。为了创建它,我需要关注以下某些条件:
示例:
考虑第一组Dataframe,即sport_name = football和person_name = rakesh这是四行:
sport_name person_name city person_symbol person_count
0 football ramesh mumbai ram 10
1 football ramesh mumbai mum 14
2 football ramesh delhi mum 25
3 football ramesh delhi ram 20
在上面四行中,如果我们查看city和person_symbol列值,我们可以看到四个不同的唯一值,分别是mumbai,delhi,ram和mum。所以对于以上四行,我们的输出如下所示:
derived_symbol person_count
football.ramesh.TOTAL.mumbai_count 24
football.ramesh.TOTAL.delhi_count 45
football.ramesh.TOTAL.ram_count 30
football.ramesh.TOTAL.mum_count 39
最终预期结果:
derived_symbol person_count
football.ramesh.TOTAL.mumbai_count 24
football.ramesh.TOTAL.delhi_count 45
football.ramesh.TOTAL.ram_count 30
football.ramesh.TOTAL.mum_count 39
football.mohit.TOTAL.pune_count 11
football.mohit.TOTAL.moh_count 11
cricket.mahesh.TOTAL.surat_count 57
cricket.mahesh.TOTAL.panji_count 77
cricket.mahesh.TOTAL.sur_count 66
cricket.mahesh.TOTAL.mah_count 68
修改:Dataframe的日期如下:
df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football'],
'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh'],
'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi'],
'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram'],
'person_count': ['10','14','25','20','34','23','43','34'],
'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-26','2017-02-26','2017-02-26','2017-02-26']})
df = df[['sport_name','person_name','city','person_symbol','person_count','month']]
print df
sport_name person_name city person_symbol person_count month
0 football ramesh mumbai ram 10 2017-01-23
1 football ramesh mumbai mum 14 2017-01-23
2 football ramesh delhi mum 25 2017-01-23
3 football ramesh delhi ram 20 2017-01-23
4 football ramesh mumbai ram 34 2017-02-26
5 football ramesh mumbai mum 23 2017-02-26
6 football ramesh delhi mum 43 2017-02-26
7 football ramesh delhi ram 34 2017-02-26
预期输出:
derived_symbol person_count month
football.ramesh.TOTAL.mumbai_count 24 2017-01-23
football.ramesh.TOTAL.delhi_count 45 2017-01-23
football.ramesh.TOTAL.ram_count 30 2017-01-23
football.ramesh.TOTAL.mum_count 39 2017-01-23
football.ramesh.TOTAL.mumbai_count 57 2017-02-26
football.ramesh.TOTAL.delhi_count 77 2017-02-26
football.ramesh.TOTAL.ram_count 68 2017-02-26
football.ramesh.TOTAL.mum_count 66 2017-02-26
我按照以下方式计算了person_count月份:
df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football'],
'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh'],
'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi'],
'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram'],
'person_count': ['10','14','25','20','34','23','43','34'],
'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-26','2017-02-26','2017-02-26','2017-02-26']})
df = df[['sport_name','person_name','city','person_symbol','person_count','month']]
df['person_count'] = df['person_count'].astype(int)
df1=df.set_index(['sport_name','person_name','person_count','month']).stack().reset_index(name='val')
df1['derived_symbol'] = df1['sport_name'] + '.' + df1['person_name'] + '.TOTAL.' + df1['val'] + '_count'
df2 = df1.groupby(['derived_symbol','month'])['person_count'].sum().reset_index(name='person_count')
print (df2)
答案 0 :(得分:2)
这是单程
首先将person_count
类型更改为数字
In [2126]: df.person_count = df.person_count.astype(int)
重塑您的数据,使city
和person_symbol
置于一个级别,然后groupby
获取总计数。
In [2127]: dff = (df.melt(id_vars=['sport_name', 'person_name', 'person_count'])
.groupby(['sport_name', 'person_name', 'value']).person_count.sum())
In [2128]: dff
Out[2128]:
sport_name person_name value
cricket mahesh mah 68
panji 77
sur 66
surat 57
football mohit moh 11
pune 11
ramesh delhi 45
mum 39
mumbai 24
ram 30
Name: person_count, dtype: int32
format
自定义格式的索引级别。
In [2129]: dff.index = ['{0}.{1}.TOTAL.{2}_count'.format(*idx) for idx in dff.index]
In [2130]: dff
Out[2130]:
cricket.mahesh.TOTAL.mah_count 68
cricket.mahesh.TOTAL.panji_count 77
cricket.mahesh.TOTAL.sur_count 66
cricket.mahesh.TOTAL.surat_count 57
football.mohit.TOTAL.moh_count 11
football.mohit.TOTAL.pune_count 11
football.ramesh.TOTAL.delhi_count 45
football.ramesh.TOTAL.mum_count 39
football.ramesh.TOTAL.mumbai_count 24
football.ramesh.TOTAL.ram_count 30
Name: person_count, dtype: int32
答案 1 :(得分:2)
使用:
#convert column to int
df['person_count'] = df['person_count'].astype(int)
#reshape for one column from city and person_symbol columns
df1=df.set_index(['sport_name','person_name','person_count']).stack().reset_index(name='val')
print (df1)
sport_name person_name person_count level_3 val
0 football ramesh 10 city mumbai
1 football ramesh 10 person_symbol ram
2 football ramesh 14 city mumbai
3 football ramesh 14 person_symbol mum
4 football ramesh 25 city delhi
5 football ramesh 25 person_symbol mum
6 football ramesh 20 city delhi
7 football ramesh 20 person_symbol ram
8 football mohit 11 city pune
9 football mohit 11 person_symbol moh
10 cricket mahesh 34 city surat
11 cricket mahesh 34 person_symbol mah
12 cricket mahesh 23 city surat
13 cricket mahesh 23 person_symbol sur
14 cricket mahesh 43 city panji
15 cricket mahesh 43 person_symbol sur
16 cricket mahesh 34 city panji
17 cricket mahesh 34 person_symbol mah
#concatenate columns
a = df1['sport_name'] + '.' + df1['person_name'] + '.TOTAL.' + df1['val'] + '_count'
#groupby by Series a and aggregate sum
df2 = df1['person_count'].groupby(a.rename('derived_symbol'), sort=False)
.sum()
.reset_index(name='person_count')
print (df2)
derived_symbol person_count
0 football.ramesh.TOTAL.mumbai_count 24
1 football.ramesh.TOTAL.ram_count 30
2 football.ramesh.TOTAL.mum_count 39
3 football.ramesh.TOTAL.delhi_count 45
4 football.mohit.TOTAL.pune_count 11
5 football.mohit.TOTAL.moh_count 11
6 cricket.mahesh.TOTAL.surat_count 57
7 cricket.mahesh.TOTAL.mah_count 68
8 cricket.mahesh.TOTAL.sur_count 66
9 cricket.mahesh.TOTAL.panji_count 77