如何在月度数据中向熊猫添加丢失的数据

时间:2018-07-26 15:11:30

标签: pandas date dataframe

我有以下数据框:

Date                        
2002-01-01  10.0   NaN   NaN
2002-05-01   NaN  30.0  40.0
2002-07-01   NaN   NaN  50.0

我想用零完成缺失的月份。我确实能够做到这一点,但是我可以做到这一点,因为您只需添加以下缺少的整个工作日范围即可。代码的相关部分用

标记
#############################

-

def createSeriesOfCompanies(df):
    listOfCompanies=list(set(df['Company']))
    dfSeries=df.pivot(index='Date', columns='Company', values='var1')
    # Here I include the missing dates
    #######################################################
    initialDate=dfSeries.index[0]
    endDate=dfSeries.index[-1]
    idx = pd.date_range(initialDate, endDate)
    dfSeries.index = pd.DatetimeIndex(dfSeries.index)
    dfSeries = dfSeries.reindex(idx, fill_value=0)
    ########################################################
    # Here it finishes the procedure    




def creatingDataFrame():

    dateList=[]
    dateList.append(datetime.date(2002,1,1))
    dateList.append(datetime.date(2002,7,1))
    dateList.append(datetime.date(2002,5,1))
    dateList.append(datetime.date(2002,5,1))
    dateList.append(datetime.date(2002,7,1))
    raw_data = {'Date': dateList,            
                'Company': ['A', 'B', 'B', 'C' , 'C'],                
                'var1': [10, 20, 30, 40 , 50]}

    df = pd.DataFrame(raw_data, columns = ['Date','Company', 'var1'])
    df.loc[1, 'var1'] = np.nan
    return df


if __name__=="__main__":
    df=creatingDataFrame()
    print(df)


    dfSeries,listOfCompanies=createSeriesOfCompanies(df)        

我想得到

Date                        
2002-01-01  10.0   NaN   NaN
2002-02-01  0      0     0
2002-03-01  0      0     0
2002-04-01  0      0     0
2002-05-01   NaN  30.0  40.0
2002-06-01  0      0     0
2002-07-01   NaN   NaN  50.0

但是我得到了

Company        A    B     C
2002-01-01  10.0  NaN   NaN
2002-01-02   0.0  0.0   0.0
2002-01-03   0.0  0.0   0.0
2002-01-04   0.0  0.0   0.0
2002-01-05   0.0  0.0   0.0
2002-01-06   0.0  0.0   0.0
2002-01-07   0.0  0.0   0.0
2002-01-08   0.0  0.0   0.0
2002-01-09   0.0  0.0   0.0
2002-01-10   0.0  0.0   0.0
2002-01-11   0.0  0.0   0.0
2002-01-12   0.0  0.0   0.0
2002-01-13   0.0  0.0   0.0
2002-01-14   0.0  0.0   0.0
2002-01-15   0.0  0.0   0.0
2002-01-16   0.0  0.0   0.0
2002-01-17   0.0  0.0   0.0
2002-01-18   0.0  0.0   0.0
2002-01-19   0.0  0.0   0.0
2002-01-20   0.0  0.0   0.0
2002-01-21   0.0  0.0   0.0
2002-01-22   0.0  0.0   0.0
2002-01-23   0.0  0.0   0.0
2002-01-24   0.0  0.0   0.0
2002-01-25   0.0  0.0   0.0
2002-01-26   0.0  0.0   0.0
2002-01-27   0.0  0.0   0.0
2002-01-28   0.0  0.0   0.0
2002-01-29   0.0  0.0   0.0
2002-01-30   0.0  0.0   0.0

...

我该如何解决这个问题?

2 个答案:

答案 0 :(得分:2)

您可以使用重新索引。给定日期为索引,

df.index = pd.to_datetime(df.index)
df.reindex(pd.date_range(df.index.min(), df.index.max(), freq = 'MS'))

            A    B      C
2002-01-01  10.0 NaN    NaN
2002-02-01  NaN  NaN    NaN
2002-03-01  NaN  NaN    NaN
2002-04-01  NaN  NaN    NaN
2002-05-01  NaN  30.0   40.0
2002-06-01  NaN  NaN    NaN
2002-07-01  NaN  NaN    50.0

答案 1 :(得分:2)

MS前使用asfreq(月初):

df=creatingDataFrame()

df = df.pivot(index='Date', columns='Company', values='var1').asfreq('MS', fill_value=0)
print (df)
Company        A     B     C
Date                        
2002-01-01  10.0   NaN   NaN
2002-02-01   0.0   0.0   0.0
2002-03-01   0.0   0.0   0.0
2002-04-01   0.0   0.0   0.0
2002-05-01   NaN  30.0  40.0
2002-06-01   0.0   0.0   0.0
2002-07-01   NaN   NaN  50.0