此问题引自this SO Question.
我想对pandas Dataframe进行一些数据分析。我有一个如下数据框:
derived_symbol sport_name person_name city \
0 football.RAM.mumbai.ram_count football RAM mumbai
1 football.RAM.mumbai.mum_count football RAM mumbai
2 football.RAM.delhi.mum_count football RAM delhi
3 football.RAM.delhi.ram_count football RAM delhi
4 football.RAM.mumbai.ram_count football RAM mumbai
5 football.RAM.mumbai.mum_count football RAM mumbai
6 football.RAM.delhi.mum_count football RAM delhi
7 football.RAM.delhi.ram_count football RAM delhi
8 basketball.MAH.pune.mah_count basketball MAH pune
9 basketball.MAH.nagpur.mah_count basketball MAH nagpur
10 basketball.MAH.TOTAL.mah_count basketball MAH No Entry
11 basketball.MAH.TOTAL.nagpur_count basketball MAH nagpur
12 basketball.MAH.TOTAL.pune_count basketball MAH pune
13 football.RAM.TOTAL.delhi_count football RAM delhi
14 football.RAM.TOTAL.delhi_count football RAM delhi
15 football.RAM.TOTAL.mum_count football RAM No Entry
16 football.RAM.TOTAL.mum_count football RAM No Entry
17 football.RAM.TOTAL.mumbai_count football RAM mumbai
18 football.RAM.TOTAL.mumbai_count football RAM mumbai
19 football.RAM.TOTAL.ram_count football RAM No Entry
20 football.RAM.TOTAL.ram_count football RAM No Entry
person_symbol month sir person_count
0 ram 2017-01-23 a 10
1 mum 2017-01-23 a 14
2 mum 2017-01-23 a 25
3 ram 2017-01-23 a 20
4 ram 2017-02-22 b 34
5 mum 2017-02-22 b 23
6 mum 2017-02-22 b 43
7 ram 2017-02-22 b 34
8 mah 2017-03-03 c 10
9 mah 2017-03-03 c 20
10 mah 2017-03-03 c 30
11 No Entry 2017-03-03 c 20
12 No Entry 2017-03-03 c 10
13 No Entry 2017-01-23 a 45
14 No Entry 2017-02-22 b 77
15 mum 2017-01-23 a 39
16 mum 2017-02-22 b 66
17 No Entry 2017-01-23 a 24
18 No Entry 2017-02-22 b 57
19 ram 2017-01-23 a 30
20 ram 2017-02-22 b 68
我想将previous_person_count列添加到此Dataframe。此数据框的“月”列包含格式为“yyyy-mm-dd”的日期。因此,我们需要查看月份,即“mm”字段,以确定它是哪个月。
通过查看本月,我们需要将“person_count”值放入下个月的“previous_person_count”值。
超出输出:
derived_symbol sport_name person_name city \
0 football.RAM.mumbai.ram_count football RAM mumbai
1 football.RAM.mumbai.mum_count football RAM mumbai
2 football.RAM.delhi.mum_count football RAM delhi
3 football.RAM.delhi.ram_count football RAM delhi
4 football.RAM.mumbai.ram_count football RAM mumbai
5 football.RAM.mumbai.mum_count football RAM mumbai
6 football.RAM.delhi.mum_count football RAM delhi
7 football.RAM.delhi.ram_count football RAM delhi
8 basketball.MAH.pune.mah_count basketball MAH pune
9 basketball.MAH.nagpur.mah_count basketball MAH nagpur
10 basketball.MAH.TOTAL.mah_count basketball MAH No Entry
11 basketball.MAH.TOTAL.nagpur_count basketball MAH nagpur
12 basketball.MAH.TOTAL.pune_count basketball MAH pune
13 football.RAM.TOTAL.delhi_count football RAM delhi
14 football.RAM.TOTAL.delhi_count football RAM delhi
15 football.RAM.TOTAL.mum_count football RAM No Entry
16 football.RAM.TOTAL.mum_count football RAM No Entry
17 football.RAM.TOTAL.mumbai_count football RAM mumbai
18 football.RAM.TOTAL.mumbai_count football RAM mumbai
19 football.RAM.TOTAL.ram_count football RAM No Entry
20 football.RAM.TOTAL.ram_count football RAM No Entry
person_symbol month sir person_count previous_person_count
0 ram 2017-01-23 a 10 0
1 mum 2017-01-23 a 14 0
2 mum 2017-01-23 a 25 0
3 ram 2017-01-23 a 20 0
4 ram 2017-02-22 b 34 10
5 mum 2017-02-22 b 23 14
6 mum 2017-02-22 b 43 25
7 ram 2017-02-22 b 34 20
8 mah 2017-03-03 c 10 0
9 mah 2017-03-03 c 20 0
10 mah 2017-03-03 c 30 0
11 No Entry 2017-03-03 c 20 0
12 No Entry 2017-03-03 c 10 0
13 No Entry 2017-01-23 a 45 0
14 No Entry 2017-02-22 b 77 45
15 mum 2017-01-23 a 39 0
16 mum 2017-02-22 b 66 39
17 No Entry 2017-01-23 a 24 0
18 No Entry 2017-02-22 b 57 24
19 ram 2017-01-23 a 30 0
20 ram 2017-02-22 b 68 30
编辑参考代码:
df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football','basketball','basketball'],
'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','mahesh','mahesh'],
'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi','pune','nagpur'],
'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram','mah','mah'],
'person_count': ['10','14','25','20','34','23','43','34','10','20'],
'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-22','2017-02-22','2017-02-22','2017-02-22','2017-03-03','2017-03-03'],
'sir': ['a','a','a','a','b','b','b','b','c','c']})
df = df[['sport_name','person_name','city','person_symbol','person_count','month','sir']]
df['person_name'] = df['person_name'].apply(symbology)
df['person_count'] = df['person_count'].astype(int)
print df
df1=df.set_index(['sport_name','person_name','person_count','month','sir']).stack().reset_index(name='val')
df1['derived_symbol'] = df1['sport_name'] + '.' + df1['person_name'] + '.TOTAL.' + df1['val'] + '_count'
df2 = df1.groupby(['derived_symbol','month','sir','sport_name','person_name','level_5','val'])['person_count'].sum().reset_index(name='person_count')
df3 = df2.set_index(['derived_symbol','month','sir','sport_name','person_name','person_count','level_5'])['val'].unstack().fillna('No Entry').rename_axis(None, 1).reset_index()
df['derived_symbol'] = df['sport_name'] + '.' + df['person_name'] + '.' + df['city'] + "."+ df['person_symbol'] + '_count'
df4 = pd.concat([df, df3]).reset_index(None)
print df3
del df4['index']
df4 = df4[['derived_symbol','sport_name','person_name','city','person_symbol','month','sir','person_count']]
print df4
便利性:
d = {'city': {0: 'mumbai',
1: 'mumbai',
2: 'delhi',
3: 'delhi',
4: 'mumbai',
5: 'mumbai',
6: 'delhi',
7: 'delhi',
8: 'pune',
9: 'nagpur',
10: 'No Entry',
11: 'nagpur',
12: 'pune',
13: 'delhi',
14: 'delhi',
15: 'No Entry',
16: 'No Entry',
17: 'mumbai',
18: 'mumbai',
19: 'No Entry',
20: 'No Entry'},
'derived_symbol': {0: 'football.RAM.mumbai.ram_count',
1: 'football.RAM.mumbai.mum_count',
2: 'football.RAM.delhi.mum_count',
3: 'football.RAM.delhi.ram_count',
4: 'football.RAM.mumbai.ram_count',
5: 'football.RAM.mumbai.mum_count',
6: 'football.RAM.delhi.mum_count',
7: 'football.RAM.delhi.ram_count',
8: 'basketball.MAH.pune.mah_count',
9: 'basketball.MAH.nagpur.mah_count',
10: 'basketball.MAH.TOTAL.mah_count',
11: 'basketball.MAH.TOTAL.nagpur_count',
12: 'basketball.MAH.TOTAL.pune_count',
13: 'football.RAM.TOTAL.delhi_count',
14: 'football.RAM.TOTAL.delhi_count',
15: 'football.RAM.TOTAL.mum_count',
16: 'football.RAM.TOTAL.mum_count',
17: 'football.RAM.TOTAL.mumbai_count',
18: 'football.RAM.TOTAL.mumbai_count',
19: 'football.RAM.TOTAL.ram_count',
20: 'football.RAM.TOTAL.ram_count'},
'month': {0: '2017-01-23',
1: '2017-01-23',
2: '2017-01-23',
3: '2017-01-23',
4: '2017-02-22',
5: '2017-02-22',
6: '2017-02-22',
7: '2017-02-22',
8: '2017-03-03',
9: '2017-03-03',
10: '2017-03-03',
11: '2017-03-03',
12: '2017-03-03',
13: '2017-01-23',
14: '2017-02-22',
15: '2017-01-23',
16: '2017-02-22',
17: '2017-01-23',
18: '2017-02-22',
19: '2017-01-23',
20: '2017-02-22'},
'person_count': {0: 10,
1: 14,
2: 25,
3: 20,
4: 34,
5: 23,
6: 43,
7: 34,
8: 10,
9: 20,
10: 30,
11: 20,
12: 10,
13: 45,
14: 77,
15: 39,
16: 66,
17: 24,
18: 57,
19: 30,
20: 68},
'person_name': {0: 'RAM',
1: 'RAM',
2: 'RAM',
3: 'RAM',
4: 'RAM',
5: 'RAM',
6: 'RAM',
7: 'RAM',
8: 'MAH',
9: 'MAH',
10: 'MAH',
11: 'MAH',
12: 'MAH',
13: 'RAM',
14: 'RAM',
15: 'RAM',
16: 'RAM',
17: 'RAM',
18: 'RAM',
19: 'RAM',
20: 'RAM'},
'person_symbol': {0: 'ram',
1: 'mum',
2: 'mum',
3: 'ram',
4: 'ram',
5: 'mum',
6: 'mum',
7: 'ram',
8: 'mah',
9: 'mah',
10: 'mah',
11: 'No Entry',
12: 'No Entry',
13: 'No Entry',
14: 'No Entry',
15: 'mum',
16: 'mum',
17: 'No Entry',
18: 'No Entry',
19: 'ram',
20: 'ram'},
'sir': {0: 'a',
1: 'a',
2: 'a',
3: 'a',
4: 'b',
5: 'b',
6: 'b',
7: 'b',
8: 'c',
9: 'c',
10: 'c',
11: 'c',
12: 'c',
13: 'a',
14: 'b',
15: 'a',
16: 'b',
17: 'a',
18: 'b',
19: 'a',
20: 'b'},
'sport_name': {0: 'football',
1: 'football',
2: 'football',
3: 'football',
4: 'football',
5: 'football',
6: 'football',
7: 'football',
8: 'basketball',
9: 'basketball',
10: 'basketball',
11: 'basketball',
12: 'basketball',
13: 'football',
14: 'football',
15: 'football',
16: 'football',
17: 'football',
18: 'football',
19: 'football',
20: 'football'}}
答案 0 :(得分:1)
在计算月份数(从日期开始)和前一个数字框架之后,您可以做的是merge
数据框本身。
让我们从计算这两个值开始。为方便起见,我首先将原始month
字符串值转换为datetime,这允许我使用relativedelta
来计算上个月。即使在一年的变化之后,这也可以确保行为正确。
In [7]: df['month'] = pd.to_datetime(df['month'])
In [8]: df['month_num'] = df['month'].apply(lambda x: x.strftime('%Y-%m'))
In [9]: from dateutil.relativedelta import relativedelta
In [10]: df['previous_month_num'] = df['month'].apply(lambda x: (x + relativedelta(months=-1)).strftime('%Y-%m'))
In [11]: df
Out[11]:
city month person_count person_name person_symbol sir sport_name \
0 mumbai 2017-01-23 10 ramesh ram a football
1 mumbai 2017-01-23 14 ramesh mum a football
2 delhi 2017-01-23 25 ramesh mum a football
3 delhi 2017-01-23 20 ramesh ram a football
4 mumbai 2017-02-22 34 ramesh ram b football
5 mumbai 2017-02-22 23 ramesh mum b football
6 delhi 2017-02-22 43 ramesh mum b football
7 delhi 2017-02-22 34 ramesh ram b football
8 pune 2017-03-03 10 mahesh mah c basketball
9 nagpur 2017-03-03 20 mahesh mah c basketball
month_num previous_month_num
0 2017-01 2016-12
1 2017-01 2016-12
2 2017-01 2016-12
3 2017-01 2016-12
4 2017-02 2017-01
5 2017-02 2017-01
6 2017-02 2017-01
7 2017-02 2017-01
8 2017-03 2017-02
9 2017-03 2017-02
然后我们可以将数据框合并到自身中,使用计算的月份值作为合并键:
In [12]: relevant_columns = ['city', 'person_symbol', 'sport_name']
In [13]: pd.merge(df, df, left_on=relevant_columns + ['previous_month_num'], right_on=rele
...: vant_columns + ['month_num'], how='left', suffixes=('', '_previous'))[list(df.col
...: umns) + ['person_count_previous']].fillna(0).drop(['month_num', 'previous_month_n
...: um'], axis=1)
Out[13]:
city month person_count person_name person_symbol sir sport_name \
0 mumbai 2017-01-23 10 ramesh ram a football
1 mumbai 2017-01-23 14 ramesh mum a football
2 delhi 2017-01-23 25 ramesh mum a football
3 delhi 2017-01-23 20 ramesh ram a football
4 mumbai 2017-02-22 34 ramesh ram b football
5 mumbai 2017-02-22 23 ramesh mum b football
6 delhi 2017-02-22 43 ramesh mum b football
7 delhi 2017-02-22 34 ramesh ram b football
8 pune 2017-03-03 10 mahesh mah c basketball
9 nagpur 2017-03-03 20 mahesh mah c basketball
person_count_previous
0 0
1 0
2 0
3 0
4 10
5 14
6 25
7 20
8 0
9 0
一些意见: