从Python中的部分数据填充过去和将来的数据

时间:2018-11-13 17:32:31

标签: python python-3.x pandas dataframe missing-data

我对数据的累积总和从198倍提高到2016年,现在的格式为:

State   Year    Month   Value
TN      1987    1       24410.0
TN      1987    2       24410.0
TN      1987    3       24410.0
TN      1987    4       24410.0
.
.
TN      1996    1       24410.0
TN      1996    2       24410.0
TN      1996    3       24410.0
TN      1996    4       24410.0
TN      1996    5       37109.0
TN      1996    6       37109.0
TN      1996    7       37109.0
TN      1996    8       37109.0
TN      1996    9       37109.0
TN      1996    10      37109.0
TN      1996    11      37109.0
TN      1996    12      37109.0
TN      2016    1       49808.0
TN      2016    2       49808.0

实际上是从1996年到2016年跳过的数据(对于TN,但各州之间有所不同)。我需要找到一种方法来普遍填充数据中所有缺失的空缺,因为有些年份不存在(2010-2015年),并且我想填充它们,以便输出一直到2018年。

我希望缺失的值被之前的值所填充,以得到如下所示的输出

TN      1996    4       24410.0
TN      1996    5       37109.0
TN      1996    6       37109.0
.
.
TN      2010    1       37109.0
TN      2010    2       37109.0
TN      2010    3       37109.0
.
.
TN      2016    1       37109.0
TN      2016    2       37109.0
.
.
TN      2016    11      49808.0
TN      2016    12      49808.0
.
.
TN      2017    1       49808.0
TN      2017    2       49808.0
TN      2017    3       49808.0
TN      2017    4       49808.0
.
.
TN      2018    1       49808.0
TN      2018    2       49808.0

2 个答案:

答案 0 :(得分:0)

您可以创建一个缺少月份的数据框,然后将其与您的结果合并:

dates = pd.date_range(start='1/1/%d' %df['Year'].min(),
                      end='1/08/%d' %df['Year'].max(),
                      freq='MS', closed='left')

>> dates

DatetimeIndex(['1987-02-01', '1987-03-01', '1987-04-01', '1987-05-01',
               '1987-06-01', '1987-07-01', '1987-08-01', '1987-09-01',
               '1987-10-01', '1987-11-01',
               ...
               '2015-04-01', '2015-05-01', '2015-06-01', '2015-07-01',
               '2015-08-01', '2015-09-01', '2015-10-01', '2015-11-01',
               '2015-12-01', '2016-01-01'],
              dtype='datetime64[ns]', length=348, freq='MS')

然后您可以创建所有月份的数据框:

all_months = pd.DataFrame.from_records((dates.year, dates.month),
      index=['Year', 'Month']).T.sort_values(by=['Year', 'Month'])

然后将其与原始数据框合并并向前填充:

df.merge(all_months, how='right').ffill()

    State    Year  Month    Value
0      TN  1987.0    1.0  24410.0
1      TN  1987.0    2.0  24410.0
2      TN  1987.0    3.0  24410.0
3      TN  1987.0    4.0  24410.0
4      TN  1996.0    1.0  24410.0
5      TN  1996.0    2.0  24410.0
6      TN  1996.0    3.0  24410.0
7      TN  1996.0    4.0  24410.0
8      TN  1996.0    5.0  37109.0
9      TN  1996.0    6.0  37109.0
10     TN  1996.0    7.0  37109.0
11     TN  1996.0    8.0  37109.0
12     TN  1996.0    9.0  37109.0
13     TN  1996.0   10.0  37109.0
14     TN  1996.0   11.0  37109.0
15     TN  1996.0   12.0  37109.0
16     TN  2016.0    1.0  49808.0
17     TN  1987.0    5.0  49808.0
18     TN  1987.0    6.0  49808.0
19     TN  1987.0    7.0  49808.0
20     TN  1987.0    8.0  49808.0
21     TN  1987.0    9.0  49808.0
22     TN  1987.0   10.0  49808.0
23     TN  1987.0   11.0  49808.0
24     TN  1987.0   12.0  49808.0
25     TN  1988.0    1.0  49808.0
26     TN  1988.0    2.0  49808.0
27     TN  1988.0    3.0  49808.0
28     TN  1988.0    4.0  49808.0
29     TN  1988.0    5.0  49808.0
..    ...     ...    ...      ...
319    TN  2013.0    7.0  49808.0
320    TN  2013.0    8.0  49808.0
321    TN  2013.0    9.0  49808.0
322    TN  2013.0   10.0  49808.0
323    TN  2013.0   11.0  49808.0
324    TN  2013.0   12.0  49808.0
325    TN  2014.0    1.0  49808.0
326    TN  2014.0    2.0  49808.0
327    TN  2014.0    3.0  49808.0
328    TN  2014.0    4.0  49808.0
329    TN  2014.0    5.0  49808.0
330    TN  2014.0    6.0  49808.0
331    TN  2014.0    7.0  49808.0
332    TN  2014.0    8.0  49808.0
333    TN  2014.0    9.0  49808.0
334    TN  2014.0   10.0  49808.0
335    TN  2014.0   11.0  49808.0
336    TN  2014.0   12.0  49808.0
337    TN  2015.0    1.0  49808.0
338    TN  2015.0    2.0  49808.0
339    TN  2015.0    3.0  49808.0
340    TN  2015.0    4.0  49808.0
341    TN  2015.0    5.0  49808.0
342    TN  2015.0    6.0  49808.0
343    TN  2015.0    7.0  49808.0
344    TN  2015.0    8.0  49808.0
345    TN  2015.0    9.0  49808.0
346    TN  2015.0   10.0  49808.0
347    TN  2015.0   11.0  49808.0
348    TN  2015.0   12.0  49808.0

使用pandas.resample

另一种解决方案是按日期索引,然后在那里重新采样:

df['Day'] = 1

df1 = df.assign(date= lambda x:pd.to_datetime(x[['Year', 'Month', 'Day']])).set_index('date')

>> df1

           State    Year  Month    Value  Day
date                                         
1987-01-01    TN  1987.0    1.0  24410.0    1
1987-02-01    TN  1987.0    2.0  24410.0    1
1987-03-01    TN  1987.0    3.0  24410.0    1
1987-04-01    TN  1987.0    4.0  24410.0    1
1996-01-01    TN  1996.0    1.0  24410.0    1
1996-02-01    TN  1996.0    2.0  24410.0    1
1996-03-01    TN  1996.0    3.0  24410.0    1
1996-04-01    TN  1996.0    4.0  24410.0    1
1996-05-01    TN  1996.0    5.0  37109.0    1
1996-06-01    TN  1996.0    6.0  37109.0    1
1996-07-01    TN  1996.0    7.0  37109.0    1
1996-08-01    TN  1996.0    8.0  37109.0    1
1996-09-01    TN  1996.0    9.0  37109.0    1
1996-10-01    TN  1996.0   10.0  37109.0    1
1996-11-01    TN  1996.0   11.0  37109.0    1
1996-12-01    TN  1996.0   12.0  37109.0    1
2016-01-01    TN  2016.0    1.0  49808.0    1
2016-02-01    TN  2016.0    2.0  49808.0    1

然后您可以按照以下步骤按月重新采样:

    res = df1.resample('M').first().ffill()

    >> res 

               State    Year  Month    Value  Day
    date                                         
    1987-01-31    TN  1987.0    1.0  24410.0  1.0
    1987-02-28    TN  1987.0    2.0  24410.0  1.0
    1987-03-31    TN  1987.0    3.0  24410.0  1.0
    1987-04-30    TN  1987.0    4.0  24410.0  1.0
    1987-05-31    TN  1987.0    4.0  24410.0  1.0
    1987-06-30    TN  1987.0    4.0  24410.0  1.0
    1987-07-31    TN  1987.0    4.0  24410.0  1.0
    1987-08-31    TN  1987.0    4.0  24410.0  1.0
    1987-09-30    TN  1987.0    4.0  24410.0  1.0
    1987-10-31    TN  1987.0    4.0  24410.0  1.0
    1987-11-30    TN  1987.0    4.0  24410.0  1.0
    1987-12-31    TN  1987.0    4.0  24410.0  1.0
    1988-01-31    TN  1987.0    4.0  24410.0  1.0
    1988-02-29    TN  1987.0    4.0  24410.0  1.0
    1988-03-31    TN  1987.0    4.0  24410.0  1.0
    1988-04-30    TN  1987.0    4.0  24410.0  1.0
    1988-05-31    TN  1987.0    4.0  24410.0  1.0
    1988-06-30    TN  1987.0    4.0  24410.0  1.0
    1988-07-31    TN  1987.0    4.0  24410.0  1.0
    1988-08-31    TN  1987.0    4.0  24410.0  1.0
    1988-09-30    TN  1987.0    4.0  24410.0  1.0
    1988-10-31    TN  1987.0    4.0  24410.0  1.0
    1988-11-30    TN  1987.0    4.0  24410.0  1.0
    1988-12-31    TN  1987.0    4.0  24410.0  1.0
    1989-01-31    TN  1987.0    4.0  24410.0  1.0
    1989-02-28    TN  1987.0    4.0  24410.0  1.0
    1989-03-31    TN  1987.0    4.0  24410.0  1.0
    1989-04-30    TN  1987.0    4.0  24410.0  1.0
    1989-05-31    TN  1987.0    4.0  24410.0  1.0
    1989-06-30    TN  1987.0    4.0  24410.0  1.0
    ...          ...     ...    ...      ...  ...
    2013-09-30    TN  1996.0   12.0  37109.0  1.0
    2013-10-31    TN  1996.0   12.0  37109.0  1.0
    2013-11-30    TN  1996.0   12.0  37109.0  1.0
    2013-12-31    TN  1996.0   12.0  37109.0  1.0
    2014-01-31    TN  1996.0   12.0  37109.0  1.0
    2014-02-28    TN  1996.0   12.0  37109.0  1.0
    2014-03-31    TN  1996.0   12.0  37109.0  1.0
    2014-04-30    TN  1996.0   12.0  37109.0  1.0
    2014-05-31    TN  1996.0   12.0  37109.0  1.0
    2014-06-30    TN  1996.0   12.0  37109.0  1.0
    2014-07-31    TN  1996.0   12.0  37109.0  1.0
    2014-08-31    TN  1996.0   12.0  37109.0  1.0
    2014-09-30    TN  1996.0   12.0  37109.0  1.0
    2014-10-31    TN  1996.0   12.0  37109.0  1.0
    2014-11-30    TN  1996.0   12.0  37109.0  1.0
    2014-12-31    TN  1996.0   12.0  37109.0  1.0
    2015-01-31    TN  1996.0   12.0  37109.0  1.0
    2015-02-28    TN  1996.0   12.0  37109.0  1.0
    2015-03-31    TN  1996.0   12.0  37109.0  1.0
    2015-04-30    TN  1996.0   12.0  37109.0  1.0
    2015-05-31    TN  1996.0   12.0  37109.0  1.0
    2015-06-30    TN  1996.0   12.0  37109.0  1.0
    2015-07-31    TN  1996.0   12.0  37109.0  1.0
    2015-08-31    TN  1996.0   12.0  37109.0  1.0
    2015-09-30    TN  1996.0   12.0  37109.0  1.0
    2015-10-31    TN  1996.0   12.0  37109.0  1.0
    2015-11-30    TN  1996.0   12.0  37109.0  1.0
    2015-12-31    TN  1996.0   12.0  37109.0  1.0
    2016-01-31    TN  2016.0    1.0  49808.0  1.0
    2016-02-29    TN  2016.0    2.0  49808.0  1.0

您可以通过执行以下操作获得原始结构:

>> res.reset_index(drop=True).drop(['Day'], axis=1).head()

        State    Year  Month    Value
    0      TN  1987.0    1.0  24410.0
    1      TN  1987.0    2.0  24410.0
    2      TN  1987.0    3.0  24410.0
    3      TN  1987.0    4.0  24410.0
    4      TN  1987.0    4.0  24410.0
    5      TN  1987.0    4.0  24410.0
    6      TN  1987.0    4.0  24410.0
    7      TN  1987.0    4.0  24410.0
    8      TN  1987.0    4.0  24410.0

答案 1 :(得分:0)

{{1}}怎么样?:根据不同的方法插入值

请参见此处的“插值”部分:Pandas interpolate() backwards in dataframe

以及先前发布的一些现有示例:{{3}}