我有一个如下数据框:
name,date
AAA,201705
AAA,201706
AAA,201707
AAA,201708
AAA,201710
AAA,201711
AAA,201802
AAA,201803
AAA,201804
AAA,201805
AAA,201806
AAA,201807
在此数据框中,有两列可用,即名称和日期。在日期列中,只有年份和月份为yyyymm格式。
日期列中的值 201709、201712和201801 月不可用。
需要检查是否存在所有月份。如果没有月份,则需要采用以下格式的输出:
name,start_date,end_date,count
AAA,201709,201709,1
AAA,201712,201801,2
我正在尝试使用pandas diff function
答案 0 :(得分:3)
使用asfreq
:
#convert column to datetimes
df['date'] = pd.to_datetime(df['date'], format='%Y%m')
# get missing values by asfreq
a = df.set_index('date').groupby('name')['name'].apply(lambda x: x.asfreq('MS'))
#filter only NaNs consecutive rows
b = a.notnull().cumsum()[a.isnull()].reset_index(name='g')
#aggregate first, last and count
d = {'date':['first','last'],'name':['first', 'size']}
df = b.groupby('g').agg(d).reset_index(drop=True)
#data cleaning
df.columns = df.columns.map('_'.join)
df = df.rename(columns={'date_first':'start_date',
'date_last':'end_date',
'name_first':'name',
'name_size':'count'})
print (df)
start_date end_date name count
0 2017-09-01 2017-09-01 AAA 1
1 2017-12-01 2018-01-01 AAA 2
详细信息:
print (a)
name date
AAA 2017-05-01 AAA
2017-06-01 AAA
2017-07-01 AAA
2017-08-01 AAA
2017-09-01 NaN
2017-10-01 AAA
2017-11-01 AAA
2017-12-01 NaN
2018-01-01 NaN
2018-02-01 AAA
2018-03-01 AAA
2018-04-01 AAA
2018-05-01 AAA
2018-06-01 AAA
2018-07-01 AAA
Name: name, dtype: object