背景
我使用pyodbc下载了以下数据框,日期介于1999年和2015年之间:
CEISales.head(10)
Out[194]:
Order_DateC RegionC SalesC
0 2014-01-30 Domestic 3530.00
1 2011-10-11 Domestic 136.00
2 1999-01-13 Domestic 30.00
3 1999-01-13 Domestic 55615.00
4 1999-01-13 Domestic 440.00
5 1999-01-13 Domestic 94.00
6 1999-01-05 Domestic 612.00
7 1999-01-14 Domestic 1067.00
8 1999-01-14 Domestic 26345.05
9 1999-01-15 Domestic 161858.72
然后我过滤了大于2010-01-01的所有日期的数据,并按升序日期排序:
CEIFilter = CEISales[CEISales['Order_DateC'] > '2010-01-01']
CEITest = CEIFilter.sort('Order_DateC')
CEITest.head(5)
Out[199]:
Order_DateC RegionC SalesC
18156 2010-01-04 Foreign 450.0
18155 2010-01-04 Domestic 1990.4
18154 2010-01-04 Domestic 37477.0
18152 2010-01-04 Domestic 0.0
18153 2010-01-04 Domestic 783.0
然后我创建了一个日期索引,其值介于2010-01-01和今天的pandas的date_range函数之间:
date_index = pd.date_range(start='2010-01-01', end='2015-12-23' , freq='d')
重新索引数据框
CEIFinal= CEITest.reindex(date_index)
我的问题是,当我重新索引数据框时,所有数据都被删除了:
CEIFinal.head(5)
Out[206]:
Order_DateC RegionC SalesC
2010-01-01 NaT NaN NaN
2010-01-02 NaT NaN NaN
2010-01-03 NaT NaN NaN
2010-01-04 NaT NaN NaN
2010-01-05 NaT NaN NaN
从原始过滤数据框中,您可以看到2010-04-01上有交易
CEITest[CEITest['Order_DateC'] == '2010-01-04']
Out[210]:
Order_DateC RegionC SalesC
18156 2010-01-04 Foreign 450.0
18155 2010-01-04 Domestic 1990.4
18154 2010-01-04 Domestic 37477.0
18152 2010-01-04 Domestic 0.0
18153 2010-01-04 Domestic 783.0
问题 如何使用此日期范围重新索引此数据框并保留所有原始值?我试图在差异数据库的几个不同数据帧上创建一个公共索引,以将它们一起添加到聚合数据框中。非常感谢您的帮助。谢谢!
答案 0 :(得分:1)
当索引不是DatetimeIndex时,您通过DatetimeIndex重新编制索引:
Order_DateC RegionC SalesC
18156 2010-01-04 Foreign 450.0
18155 2010-01-04 Domestic 1990.4
18154 2010-01-04 Domestic 37477.0
18152 2010-01-04 Domestic 0.0
18153 2010-01-04 Domestic 783.0
因此NaNs和NaTs。
也许你想让Order_DateC
成为索引:
df = df.set_index("Order_DateC")
然后到resample。
如果重新编制索引,则会丢失重复日期的行。
答案 1 :(得分:1)
我认为您需要在reindex:
之前设置列Order_DateC
的索引
CEITest = CEITest.set_index('Order_DateC')
print CEIFinal[CEIFinal.notnull().any(axis=1)]
RegionC SalesC
2011-10-11 Domestic 136
2014-01-30 Domestic 3530
所有在一起:
print CEISales
Order_DateC RegionC SalesC
0 2014-01-30 Domestic 3530.00
1 2011-10-11 Domestic 136.00
2 1999-01-13 Domestic 30.00
3 1999-01-13 Domestic 55615.00
4 1999-01-13 Domestic 440.00
5 1999-01-13 Domestic 94.00
6 1999-01-05 Domestic 612.00
7 1999-01-14 Domestic 1067.00
8 1999-01-14 Domestic 26345.05
9 1999-01-15 Domestic 161858.72
CEIFilter = CEISales[CEISales['Order_DateC'] > '2010-01-01']
CEITest = CEIFilter.sort_values('Order_DateC')
print CEITest
Order_DateC RegionC SalesC
1 2011-10-11 Domestic 136
0 2014-01-30 Domestic 3530
#set index to datetimeindex
CEITest = CEITest.set_index('Order_DateC')
print CEITest
RegionC SalesC
Order_DateC
2011-10-11 Domestic 136
2014-01-30 Domestic 3530
date_index = pd.date_range(start='2010-01-01', end='2015-12-23' , freq='d')
CEIFinal= CEITest.reindex(date_index)
print CEIFinal.head()
RegionC SalesC
2010-01-01 NaN NaN
2010-01-02 NaN NaN
2010-01-03 NaN NaN
2010-01-04 NaN NaN
2010-01-05 NaN NaN
可以有很多Nat
和NaN
,检查数据:
print CEIFinal[CEIFinal.notnull().any(axis=1)]
RegionC SalesC
2011-10-11 Domestic 136
2014-01-30 Domestic 3530
最后,您可以设置索引名称和reset_index
索引 - 列名称是索引名称:
CEIFinal.index.name = 'CEIFinal'
CEIFinal = CEIFinal.reset_index()
print CEIFinal.head()
CEIFinal RegionC SalesC
0 2010-01-01 NaN NaN
1 2010-01-02 NaN NaN
2 2010-01-03 NaN NaN
3 2010-01-04 NaN NaN
4 2010-01-05 NaN NaN