如何跟踪pandas数据帧中的上一个日期记录列?

时间:2017-09-13 11:16:05

标签: python pandas dataframe

此问题引自this SO Question.

我想对pandas Dataframe进行一些数据分析。我有一个如下数据框:

                   derived_symbol  sport_name person_name      city  \
0       football.RAM.mumbai.ram_count    football         RAM    mumbai   
1       football.RAM.mumbai.mum_count    football         RAM    mumbai   
2        football.RAM.delhi.mum_count    football         RAM     delhi   
3        football.RAM.delhi.ram_count    football         RAM     delhi   
4       football.RAM.mumbai.ram_count    football         RAM    mumbai   
5       football.RAM.mumbai.mum_count    football         RAM    mumbai   
6        football.RAM.delhi.mum_count    football         RAM     delhi   
7        football.RAM.delhi.ram_count    football         RAM     delhi   
8       basketball.MAH.pune.mah_count  basketball         MAH      pune   
9     basketball.MAH.nagpur.mah_count  basketball         MAH    nagpur   
10     basketball.MAH.TOTAL.mah_count  basketball         MAH  No Entry   
11  basketball.MAH.TOTAL.nagpur_count  basketball         MAH    nagpur   
12    basketball.MAH.TOTAL.pune_count  basketball         MAH      pune   
13     football.RAM.TOTAL.delhi_count    football         RAM     delhi   
14     football.RAM.TOTAL.delhi_count    football         RAM     delhi   
15       football.RAM.TOTAL.mum_count    football         RAM  No Entry   
16       football.RAM.TOTAL.mum_count    football         RAM  No Entry   
17    football.RAM.TOTAL.mumbai_count    football         RAM    mumbai   
18    football.RAM.TOTAL.mumbai_count    football         RAM    mumbai   
19       football.RAM.TOTAL.ram_count    football         RAM  No Entry   
20       football.RAM.TOTAL.ram_count    football         RAM  No Entry   

   person_symbol       month sir  person_count  
0            ram  2017-01-23   a            10  
1            mum  2017-01-23   a            14  
2            mum  2017-01-23   a            25  
3            ram  2017-01-23   a            20  
4            ram  2017-02-22   b            34  
5            mum  2017-02-22   b            23  
6            mum  2017-02-22   b            43  
7            ram  2017-02-22   b            34  
8            mah  2017-03-03   c            10  
9            mah  2017-03-03   c            20  
10           mah  2017-03-03   c            30  
11      No Entry  2017-03-03   c            20  
12      No Entry  2017-03-03   c            10  
13      No Entry  2017-01-23   a            45  
14      No Entry  2017-02-22   b            77  
15           mum  2017-01-23   a            39  
16           mum  2017-02-22   b            66  
17      No Entry  2017-01-23   a            24  
18      No Entry  2017-02-22   b            57  
19           ram  2017-01-23   a            30  
20           ram  2017-02-22   b            68

我想将previous_person_count列添加到此Dataframe。此数据框的“月”列包含格式为“yyyy-mm-dd”的日期。因此,我们需要查看月份,即“mm”字段,以确定它是哪个月。

通过查看本月,我们需要将“person_count”值放入下个月的“previous_person_count”值。

超出输出:

                derived_symbol  sport_name person_name      city  \
0       football.RAM.mumbai.ram_count    football         RAM    mumbai   
1       football.RAM.mumbai.mum_count    football         RAM    mumbai   
2        football.RAM.delhi.mum_count    football         RAM     delhi   
3        football.RAM.delhi.ram_count    football         RAM     delhi   
4       football.RAM.mumbai.ram_count    football         RAM    mumbai   
5       football.RAM.mumbai.mum_count    football         RAM    mumbai   
6        football.RAM.delhi.mum_count    football         RAM     delhi   
7        football.RAM.delhi.ram_count    football         RAM     delhi   
8       basketball.MAH.pune.mah_count  basketball         MAH      pune   
9     basketball.MAH.nagpur.mah_count  basketball         MAH    nagpur   
10     basketball.MAH.TOTAL.mah_count  basketball         MAH  No Entry   
11  basketball.MAH.TOTAL.nagpur_count  basketball         MAH    nagpur   
12    basketball.MAH.TOTAL.pune_count  basketball         MAH      pune   
13     football.RAM.TOTAL.delhi_count    football         RAM     delhi   
14     football.RAM.TOTAL.delhi_count    football         RAM     delhi   
15       football.RAM.TOTAL.mum_count    football         RAM  No Entry   
16       football.RAM.TOTAL.mum_count    football         RAM  No Entry   
17    football.RAM.TOTAL.mumbai_count    football         RAM    mumbai   
18    football.RAM.TOTAL.mumbai_count    football         RAM    mumbai   
19       football.RAM.TOTAL.ram_count    football         RAM  No Entry   
20       football.RAM.TOTAL.ram_count    football         RAM  No Entry   

   person_symbol       month sir  person_count      previous_person_count
0            ram  2017-01-23   a            10      0
1            mum  2017-01-23   a            14      0
2            mum  2017-01-23   a            25      0
3            ram  2017-01-23   a            20      0
4            ram  2017-02-22   b            34      10
5            mum  2017-02-22   b            23      14
6            mum  2017-02-22   b            43      25
7            ram  2017-02-22   b            34      20
8            mah  2017-03-03   c            10      0
9            mah  2017-03-03   c            20      0
10           mah  2017-03-03   c            30      0
11      No Entry  2017-03-03   c            20      0
12      No Entry  2017-03-03   c            10      0
13      No Entry  2017-01-23   a            45      0
14      No Entry  2017-02-22   b            77      45
15           mum  2017-01-23   a            39      0
16           mum  2017-02-22   b            66      39
17      No Entry  2017-01-23   a            24      0
18      No Entry  2017-02-22   b            57      24
19           ram  2017-01-23   a            30      0
20           ram  2017-02-22   b            68      30

编辑参考代码:

df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football','basketball','basketball'],
           'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','mahesh','mahesh'],
               'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi','pune','nagpur'],
        'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram','mah','mah'],
        'person_count': ['10','14','25','20','34','23','43','34','10','20'],
        'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-22','2017-02-22','2017-02-22','2017-02-22','2017-03-03','2017-03-03'],
        'sir': ['a','a','a','a','b','b','b','b','c','c']})

df = df[['sport_name','person_name','city','person_symbol','person_count','month','sir']]

df['person_name'] = df['person_name'].apply(symbology)

df['person_count'] = df['person_count'].astype(int)

print df
df1=df.set_index(['sport_name','person_name','person_count','month','sir']).stack().reset_index(name='val')

df1['derived_symbol'] = df1['sport_name'] + '.' + df1['person_name'] + '.TOTAL.' + df1['val'] + '_count'

df2 = df1.groupby(['derived_symbol','month','sir','sport_name','person_name','level_5','val'])['person_count'].sum().reset_index(name='person_count')

df3 = df2.set_index(['derived_symbol','month','sir','sport_name','person_name','person_count','level_5'])['val'].unstack().fillna('No Entry').rename_axis(None, 1).reset_index()

df['derived_symbol'] = df['sport_name'] + '.' + df['person_name'] + '.' + df['city'] + "."+ df['person_symbol'] + '_count'
df4 = pd.concat([df, df3]).reset_index(None)
print df3
del df4['index']
df4 = df4[['derived_symbol','sport_name','person_name','city','person_symbol','month','sir','person_count']]
print df4

便利性:

d = {'city': {0: 'mumbai',
  1: 'mumbai',
  2: 'delhi',
  3: 'delhi',
  4: 'mumbai',
  5: 'mumbai',
  6: 'delhi',
  7: 'delhi',
  8: 'pune',
  9: 'nagpur',
  10: 'No Entry',
  11: 'nagpur',
  12: 'pune',
  13: 'delhi',
  14: 'delhi',
  15: 'No Entry',
  16: 'No Entry',
  17: 'mumbai',
  18: 'mumbai',
  19: 'No Entry',
  20: 'No Entry'},
 'derived_symbol': {0: 'football.RAM.mumbai.ram_count',
  1: 'football.RAM.mumbai.mum_count',
  2: 'football.RAM.delhi.mum_count',
  3: 'football.RAM.delhi.ram_count',
  4: 'football.RAM.mumbai.ram_count',
  5: 'football.RAM.mumbai.mum_count',
  6: 'football.RAM.delhi.mum_count',
  7: 'football.RAM.delhi.ram_count',
  8: 'basketball.MAH.pune.mah_count',
  9: 'basketball.MAH.nagpur.mah_count',
  10: 'basketball.MAH.TOTAL.mah_count',
  11: 'basketball.MAH.TOTAL.nagpur_count',
  12: 'basketball.MAH.TOTAL.pune_count',
  13: 'football.RAM.TOTAL.delhi_count',
  14: 'football.RAM.TOTAL.delhi_count',
  15: 'football.RAM.TOTAL.mum_count',
  16: 'football.RAM.TOTAL.mum_count',
  17: 'football.RAM.TOTAL.mumbai_count',
  18: 'football.RAM.TOTAL.mumbai_count',
  19: 'football.RAM.TOTAL.ram_count',
  20: 'football.RAM.TOTAL.ram_count'},
 'month': {0: '2017-01-23',
  1: '2017-01-23',
  2: '2017-01-23',
  3: '2017-01-23',
  4: '2017-02-22',
  5: '2017-02-22',
  6: '2017-02-22',
  7: '2017-02-22',
  8: '2017-03-03',
  9: '2017-03-03',
  10: '2017-03-03',
  11: '2017-03-03',
  12: '2017-03-03',
  13: '2017-01-23',
  14: '2017-02-22',
  15: '2017-01-23',
  16: '2017-02-22',
  17: '2017-01-23',
  18: '2017-02-22',
  19: '2017-01-23',
  20: '2017-02-22'},
 'person_count': {0: 10,
  1: 14,
  2: 25,
  3: 20,
  4: 34,
  5: 23,
  6: 43,
  7: 34,
  8: 10,
  9: 20,
  10: 30,
  11: 20,
  12: 10,
  13: 45,
  14: 77,
  15: 39,
  16: 66,
  17: 24,
  18: 57,
  19: 30,
  20: 68},
 'person_name': {0: 'RAM',
  1: 'RAM',
  2: 'RAM',
  3: 'RAM',
  4: 'RAM',
  5: 'RAM',
  6: 'RAM',
  7: 'RAM',
  8: 'MAH',
  9: 'MAH',
  10: 'MAH',
  11: 'MAH',
  12: 'MAH',
  13: 'RAM',
  14: 'RAM',
  15: 'RAM',
  16: 'RAM',
  17: 'RAM',
  18: 'RAM',
  19: 'RAM',
  20: 'RAM'},
 'person_symbol': {0: 'ram',
  1: 'mum',
  2: 'mum',
  3: 'ram',
  4: 'ram',
  5: 'mum',
  6: 'mum',
  7: 'ram',
  8: 'mah',
  9: 'mah',
  10: 'mah',
  11: 'No Entry',
  12: 'No Entry',
  13: 'No Entry',
  14: 'No Entry',
  15: 'mum',
  16: 'mum',
  17: 'No Entry',
  18: 'No Entry',
  19: 'ram',
  20: 'ram'},
 'sir': {0: 'a',
  1: 'a',
  2: 'a',
  3: 'a',
  4: 'b',
  5: 'b',
  6: 'b',
  7: 'b',
  8: 'c',
  9: 'c',
  10: 'c',
  11: 'c',
  12: 'c',
  13: 'a',
  14: 'b',
  15: 'a',
  16: 'b',
  17: 'a',
  18: 'b',
  19: 'a',
  20: 'b'},
 'sport_name': {0: 'football',
  1: 'football',
  2: 'football',
  3: 'football',
  4: 'football',
  5: 'football',
  6: 'football',
  7: 'football',
  8: 'basketball',
  9: 'basketball',
  10: 'basketball',
  11: 'basketball',
  12: 'basketball',
  13: 'football',
  14: 'football',
  15: 'football',
  16: 'football',
  17: 'football',
  18: 'football',
  19: 'football',
  20: 'football'}}

1 个答案:

答案 0 :(得分:1)

在计算月份数(从日期开始)和前一个数字框架之后,您可以做的是merge数据框本身。

让我们从计算这两个值开始。为方便起见,我首先将原始month字符串值转换为datetime,这允许我使用relativedelta来计算上个月。即使在一年的变化之后,这也可以确保行为正确。

In [7]: df['month'] = pd.to_datetime(df['month'])

In [8]: df['month_num'] = df['month'].apply(lambda x: x.strftime('%Y-%m'))

In [9]: from dateutil.relativedelta import relativedelta

In [10]: df['previous_month_num'] = df['month'].apply(lambda x: (x + relativedelta(months=-1)).strftime('%Y-%m'))

In [11]: df
Out[11]:
     city      month person_count person_name person_symbol sir  sport_name  \
0  mumbai 2017-01-23           10      ramesh           ram   a    football
1  mumbai 2017-01-23           14      ramesh           mum   a    football
2   delhi 2017-01-23           25      ramesh           mum   a    football
3   delhi 2017-01-23           20      ramesh           ram   a    football
4  mumbai 2017-02-22           34      ramesh           ram   b    football
5  mumbai 2017-02-22           23      ramesh           mum   b    football
6   delhi 2017-02-22           43      ramesh           mum   b    football
7   delhi 2017-02-22           34      ramesh           ram   b    football
8    pune 2017-03-03           10      mahesh           mah   c  basketball
9  nagpur 2017-03-03           20      mahesh           mah   c  basketball

  month_num previous_month_num
0   2017-01            2016-12
1   2017-01            2016-12
2   2017-01            2016-12
3   2017-01            2016-12
4   2017-02            2017-01
5   2017-02            2017-01
6   2017-02            2017-01
7   2017-02            2017-01
8   2017-03            2017-02
9   2017-03            2017-02

然后我们可以将数据框合并到自身中,使用计算的月份值作为合并键:

In [12]: relevant_columns = ['city', 'person_symbol', 'sport_name']

In [13]: pd.merge(df, df, left_on=relevant_columns + ['previous_month_num'], right_on=rele
    ...: vant_columns + ['month_num'], how='left', suffixes=('', '_previous'))[list(df.col
    ...: umns) + ['person_count_previous']].fillna(0).drop(['month_num', 'previous_month_n
    ...: um'], axis=1)
Out[13]:
     city      month person_count person_name person_symbol sir  sport_name  \
0  mumbai 2017-01-23           10      ramesh           ram   a    football
1  mumbai 2017-01-23           14      ramesh           mum   a    football
2   delhi 2017-01-23           25      ramesh           mum   a    football
3   delhi 2017-01-23           20      ramesh           ram   a    football
4  mumbai 2017-02-22           34      ramesh           ram   b    football
5  mumbai 2017-02-22           23      ramesh           mum   b    football
6   delhi 2017-02-22           43      ramesh           mum   b    football
7   delhi 2017-02-22           34      ramesh           ram   b    football
8    pune 2017-03-03           10      mahesh           mah   c  basketball
9  nagpur 2017-03-03           20      mahesh           mah   c  basketball

  person_count_previous
0                     0
1                     0
2                     0
3                     0
4                    10
5                    14
6                    25
7                    20
8                     0
9                     0

一些意见:

  • 我使用['city', 'person_symbol', 'sport_name']作为参考列,但可以随意添加更多内容,具体取决于您希望实现的目标。
  • 新列名为person_count_previous,但如果最适合您,可以rename
  • 默认情况下,如果前一个计数不匹配,则该列将为NaN。由于fillna
  • ,我将值替换为0
  • 我删除了"临时"使用drop的列,但随时保留它们。