我有像这样的Pandas数据框
Date Curr Amount
1/1/2015 USD 100.00
1/2/2015 USD 125.00
1/5/2015 USD 110.00
1/6/2015 USD 115.00
1/1/2015 AUD 100.00
1/2/2015 AUD 125.00
1/5/2015 AUD 110.00
1/6/2015 AUD 115.00
所需的输出
Date curr Amount
1/1/2015 usd 100.00
1/2/2015 usd 125.00
1/3/2015 usd 125.00
1/4/2015 usd 125.00
1/5/2015 usd 110.00
1/6/2015 usd 115.00
1/1/2015 aud 100.00
1/2/2015 aud 125.00
1/3/2015 aud 125.00
1/4/2015 aud 125.00
1/5/2015 aud 110.00
1/6/2015 aud 115.00
源数据仅记录金额的变化,我想插入缺少的日期和预先跳过的金额。
从我的例子中,它从1/2跳到1/5。我希望使用1/2金额填写金额列,并为缺少的日期创建3个新行。
谢谢
答案 0 :(得分:3)
你几乎想做与此相同的事情: How to fill the missing record of Pandas dataframe in pythonic way?
您需要构建完整索引,然后使用fillna
方法和前向填充'ffill'
选项。
import pandas
from io import StringIO
data = StringIO("""\
Date Curr Amount
1/1/2015 USD 100.00
1/2/2015 USD 125.00
1/5/2015 USD 110.00
1/6/2015 USD 115.00
1/1/2015 AUD 100.00
1/2/2015 AUD 125.00
1/5/2015 AUD 110.00
1/6/2015 AUD 115.00
""")
df = pandas.read_table(data, sep='\s+', parse_dates=[0])
full_index = pandas.MultiIndex.from_product([
pandas.date_range(start='2015-01-01', end='2015-01-08'),
['USD', 'AUD']
], names=['Date', 'Curr'])
df2 = (
df.set_index(['Date', 'Curr'])
.reindex(full_index)
.unstack(level='Curr') # pivot Curr into columns
.fillna(method='ffill') # drag the last valid value into the NaNs
.stack(level='Curr') # put Curr back into rows
.reset_index() # remove the index
.sort(['Curr', 'Date']) # sort the row
.reset_index(drop=True) # set the index back to 0, 1, ... N
)
print(df2)
这给了我们:
Date Curr Amount
0 2015-01-01 AUD 100
1 2015-01-02 AUD 125
2 2015-01-03 AUD 125
3 2015-01-04 AUD 125
4 2015-01-05 AUD 110
5 2015-01-06 AUD 115
6 2015-01-07 AUD 115
7 2015-01-08 AUD 115
8 2015-01-01 USD 100
9 2015-01-02 USD 125
10 2015-01-03 USD 125
11 2015-01-04 USD 125
12 2015-01-05 USD 110
13 2015-01-06 USD 115
14 2015-01-07 USD 115
15 2015-01-08 USD 115
答案 1 :(得分:3)
应该拆开的非常长的两个衬垫:
idx = pd.DatetimeIndex(start=min(df.Date), end=max(df.Date), freq='D')
df2 = (pd.DataFrame(df.set_index(['Date', 'Curr']).unstack('Curr'), index=idx).fillna(0)
+ df.set_index(['Date', 'Curr']).unstack('Curr')).ffill().stack()
>>> df2
Amount
Curr
2015-01-01 AUD 100
USD 100
2015-01-02 AUD 125
USD 125
2015-01-03 AUD 125
USD 125
2015-01-04 AUD 125
USD 125
2015-01-05 AUD 110
USD 110
2015-01-06 AUD 115
USD 115
详细介绍,我首先使用原始DataFrame中的最小和最大日期创建DatetimeIndex。我将频率设置为每日(' D'),但您可能希望使用其他offset frequency,例如营业日(' B'):
idx = pd.DatetimeIndex(start=min(df.Date), end=max(df.Date), freq='D')
然后我将数据框取消堆叠,以便我只在索引中包含日期。
df_temp = df.set_index(['Date', 'Curr']).unstack('Curr')
>>> df_temp
Amount
Curr AUD USD
Date
1/1/2015 100 100
1/2/2015 125 125
1/5/2015 110 110
1/6/2015 115 115
我创建了一个临时DataFrame,它将是所有NaN,但包含我新的扩展日期列表。我用零填充此DataFrame并用df_temp:
中的值覆盖它df_temp2 = (pd.DataFrame(df_temp, index=idx).fillna(0) + df_temp)
>>> df_temp2
Amount
Curr AUD USD
2015-01-01 100 100
2015-01-02 125 125
2015-01-03 NaN NaN
2015-01-04 NaN NaN
2015-01-05 110 110
2015-01-06 115 115
最后,我填写值以删除NaN,并堆叠货币:
>>> df_temp2.ffill().stack()
Amount
Curr
2015-01-01 AUD 100
USD 100
2015-01-02 AUD 125
USD 125
2015-01-03 AUD 125
USD 125
2015-01-04 AUD 125
USD 125
2015-01-05 AUD 110
USD 110
2015-01-06 AUD 115
USD 115