如何在不删除值的情况下按日期范围重新索引pandas数据帧

时间:2015-12-23 15:59:08

标签: python python-2.7 pandas

背景

我使用pyodbc下载了以下数据框,日期介于1999年和2015年之间:

CEISales.head(10)
Out[194]: 
   Order_DateC   RegionC     SalesC
0  2014-01-30  Domestic    3530.00
1  2011-10-11  Domestic     136.00
2  1999-01-13  Domestic      30.00
3  1999-01-13  Domestic   55615.00
4  1999-01-13  Domestic     440.00
5  1999-01-13  Domestic      94.00
6  1999-01-05  Domestic     612.00
7  1999-01-14  Domestic    1067.00
8  1999-01-14  Domestic   26345.05
9  1999-01-15  Domestic  161858.72

然后我过滤了大于2010-01-01的所有日期的数据,并按升序日期排序:

CEIFilter = CEISales[CEISales['Order_DateC'] > '2010-01-01']

CEITest = CEIFilter.sort('Order_DateC')

CEITest.head(5)
Out[199]: 
      Order_DateC   RegionC   SalesC
18156  2010-01-04   Foreign    450.0
18155  2010-01-04  Domestic   1990.4
18154  2010-01-04  Domestic  37477.0
18152  2010-01-04  Domestic      0.0
18153  2010-01-04  Domestic    783.0

然后我创建了一个日期索引,其值介于2010-01-01和今天的pandas的date_range函数之间:

date_index = pd.date_range(start='2010-01-01', end='2015-12-23' , freq='d')

重新索引数据框

CEIFinal= CEITest.reindex(date_index)

我的问题是,当我重新索引数据框时,所有数据都被删除了:

CEIFinal.head(5)
Out[206]: 
            Order_DateC RegionC  SalesC
2010-01-01         NaT     NaN     NaN
2010-01-02         NaT     NaN     NaN
2010-01-03         NaT     NaN     NaN
2010-01-04         NaT     NaN     NaN
2010-01-05         NaT     NaN     NaN

从原始过滤数据框中,您可以看到2010-04-01上有交易

CEITest[CEITest['Order_DateC'] == '2010-01-04']
Out[210]: 
      Order_DateC   RegionC   SalesC
18156  2010-01-04   Foreign    450.0
18155  2010-01-04  Domestic   1990.4
18154  2010-01-04  Domestic  37477.0
18152  2010-01-04  Domestic      0.0
18153  2010-01-04  Domestic    783.0

问题 如何使用此日期范围重新索引此数据框并保留所有原始值?我试图在差异数据库的几个不同数据帧上创建一个公共索引,以将它们一起添加到聚合数据框中。非常感谢您的帮助。谢谢!

2 个答案:

答案 0 :(得分:1)

当索引不是DatetimeIndex时,您通过DatetimeIndex重新编制索引:

      Order_DateC   RegionC   SalesC
18156  2010-01-04   Foreign    450.0
18155  2010-01-04  Domestic   1990.4
18154  2010-01-04  Domestic  37477.0
18152  2010-01-04  Domestic      0.0
18153  2010-01-04  Domestic    783.0

因此NaNs和NaTs。

也许你想让Order_DateC成为索引:

df = df.set_index("Order_DateC")

然后到resample

如果重新编制索引,则会丢失重复日期的行。

答案 1 :(得分:1)

我认为您需要在reindex:

之前设置列Order_DateC的索引
CEITest = CEITest.set_index('Order_DateC')

最后,您可以使用isnull any检查notnull值:

print CEIFinal[CEIFinal.notnull().any(axis=1)]

             RegionC  SalesC
2011-10-11  Domestic     136
2014-01-30  Domestic    3530

所有在一起:

print CEISales
  Order_DateC   RegionC     SalesC
0  2014-01-30  Domestic    3530.00
1  2011-10-11  Domestic     136.00
2  1999-01-13  Domestic      30.00
3  1999-01-13  Domestic   55615.00
4  1999-01-13  Domestic     440.00
5  1999-01-13  Domestic      94.00
6  1999-01-05  Domestic     612.00
7  1999-01-14  Domestic    1067.00
8  1999-01-14  Domestic   26345.05
9  1999-01-15  Domestic  161858.72

CEIFilter = CEISales[CEISales['Order_DateC'] > '2010-01-01']
CEITest = CEIFilter.sort_values('Order_DateC')
print CEITest
  Order_DateC   RegionC  SalesC
1  2011-10-11  Domestic     136
0  2014-01-30  Domestic    3530

#set index to datetimeindex
CEITest = CEITest.set_index('Order_DateC')
print CEITest
              RegionC  SalesC
Order_DateC                  
2011-10-11   Domestic     136
2014-01-30   Domestic    3530

date_index = pd.date_range(start='2010-01-01', end='2015-12-23' , freq='d')
CEIFinal= CEITest.reindex(date_index)

print CEIFinal.head()
           RegionC  SalesC
2010-01-01     NaN     NaN
2010-01-02     NaN     NaN
2010-01-03     NaN     NaN
2010-01-04     NaN     NaN
2010-01-05     NaN     NaN

可以有很多NatNaN,检查数据:

print CEIFinal[CEIFinal.notnull().any(axis=1)]
             RegionC  SalesC
2011-10-11  Domestic     136
2014-01-30  Domestic    3530

最后,您可以设置索引名称和reset_index索引 - 列名称是索引名称:

CEIFinal.index.name = 'CEIFinal'
CEIFinal = CEIFinal.reset_index()
print CEIFinal.head()
   CEIFinal RegionC  SalesC
0 2010-01-01     NaN     NaN
1 2010-01-02     NaN     NaN
2 2010-01-03     NaN     NaN
3 2010-01-04     NaN     NaN
4 2010-01-05     NaN     NaN