如何计算季度差异并在python pandas中添加带有计数的缺失季度

时间:2018-07-16 09:07:24

标签: python pandas

我有这样的数据框,我必须丢失季度值并在它们之间进行计数 与季度丢失计数相同,并填充数据框为

year    Data    Id
    2019Q4   57170   A
    2019Q3   55150   A
    2019Q2   51109   A 
    2019Q1   51109   A
    2018Q1   57170   B
    2018Q4   55150   B
    2017Q4   51109   C
    2017Q2   51109   C
    2017Q1   51109   C 

Id开始年份结束年份计数

  B  2018Q2    2018Q3        2
  B  2017Q3    2018Q3        1

如何使用python熊猫实现这一目标

1 个答案:

答案 0 :(得分:0)

使用:

#changed data for more general solution - multiple missing years per groups
print (df)
   year   Data Id
0  2015  57170  A
1  2016  55150  A
2  2019  51109  A
3  2023  51109  A
4  2000  47740  B
5  2002  44563  B
6  2003  43643  C
7  2004  42050  C
8  2007  37312  C

#add missing values for no years by reindex
df1 = (df.set_index('year')
       .groupby('Id')['Id']
       .apply(lambda x: x.reindex(np.arange(x.index.min(), x.index.max() + 1)))
       .reset_index(name='val'))
print (df1)
   Id  year  val
0   A  2015    A
1   A  2016    A
2   A  2017  NaN
3   A  2018  NaN
4   A  2019    A
5   A  2020  NaN
6   A  2021  NaN
7   A  2022  NaN
8   A  2023    A
9   B  2000    B
10  B  2001  NaN
11  B  2002    B
12  C  2003    C
13  C  2004    C
14  C  2005  NaN
15  C  2006  NaN
16  C  2007    C

#boolean mask for check no NaNs to variable for reuse
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()

#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
                     .agg(['first','last','size'])
                     .reset_index(level=1, drop=True)
                     .reset_index())
print (df2)
  Id  first  last  size
0  A   2017  2018     2
1  A   2020  2022     3
2  B   2001  2001     1
3  C   2005  2006     2

编辑:

#convert to datetimes
df['year'] = pd.to_datetime(df['year'], format='%Y%m')
#resample by start of months with asfreq
df1 = df.set_index('year').groupby('Id')['Id'].resample('MS').asfreq().rename('val').reset_index()
print (df1)
   Id       year  val
0   A 2015-05-01    A
1   A 2015-06-01  NaN
2   A 2015-07-01    A
3   A 2015-08-01  NaN
4   A 2015-09-01    A
5   B 2000-01-01    B
6   B 2000-02-01  NaN
7   B 2000-03-01    B
8   C 2003-01-01    C
9   C 2003-02-01    C
10  C 2003-03-01  NaN
11  C 2003-04-01  NaN
12  C 2003-05-01    C

m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()

#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
                     .agg(['first','last','size'])
                     .reset_index(level=1, drop=True)
                     .reset_index())
print (df2)
  Id      first       last  size
0  A 2015-06-01 2015-06-01     1
1  A 2015-08-01 2015-08-01     1
2  B 2000-02-01 2000-02-01     1
3  C 2003-03-01 2003-04-01     2