我有一个数据框,其中包含一个房间的预订(行:booking_id,入住日期和退房日期,我想将其转换为按全年天数索引的时间序列(索引:一年中的天数,功能:已预订)还是没有)。
我已经计算出预订的持续时间,并每天重新索引数据框。 现在,我需要向前填充数据框,但次数有限:每次预订的持续时间。
尝试使用填充对每一行进行迭代,但是它适用于整个数据框,而不适用于选定的行。 知道我该怎么做吗?
这是我的代码:
import numpy as np
import pandas as pd
#create dataframe
data=[[1, '2019-01-01', '2019-01-02', 1],
[2, '2019-01-03', '2019-01-07', 4],
[3, '2019-01-10','2019-01-13', 3]]
df = pd.DataFrame(data, columns=['booking_id', 'check-in', 'check-out', 'duration'])
#cast dates to datetime formats
df['check-in'] = pd.to_datetime(df['check-in'])
df['check-out'] = pd.to_datetime(df['check-out'])
#create timeseries indexed on check-in date
df2 = df.set_index('check-in')
#create new index and reindex timeseries
idx = pd.date_range(min(df['check-in']), max(df['check-out']), freq='D')
ts = df2.reindex(idx)
我有这个:
booking_id check-out duration
2019-01-01 1.0 2019-01-02 1.0
2019-01-02 NaN NaT NaN
2019-01-03 2.0 2019-01-07 4.0
2019-01-04 NaN NaT NaN
2019-01-05 NaN NaT NaN
2019-01-06 NaN NaT NaN
2019-01-07 NaN NaT NaN
2019-01-08 NaN NaT NaN
2019-01-09 NaN NaT NaN
2019-01-10 3.0 2019-01-13 3.0
2019-01-11 NaN NaT NaN
2019-01-12 NaN NaT NaN
2019-01-13 NaN NaT NaN
我希望有:
booking_id check-out duration
2019-01-01 1.0 2019-01-02 1.0
2019-01-02 1.0 2019-01-02 1.0
2019-01-03 2.0 2019-01-07 4.0
2019-01-04 2.0 2019-01-07 4.0
2019-01-05 2.0 2019-01-07 4.0
2019-01-06 2.0 2019-01-07 4.0
2019-01-07 NaN NaT NaN
2019-01-08 NaN NaT NaN
2019-01-09 NaN NaT NaN
2019-01-10 3.0 2019-01-13 3.0
2019-01-11 3.0 2019-01-13 3.0
2019-01-12 3.0 2019-01-13 3.0
2019-01-13 NaN NaT NaN
答案 0 :(得分:1)
我认为要“向前填充数据框”,应使用pandas插值方法。文档可以在这里找到:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html
您可以执行以下操作:
int_how_many_consecutive_to_fill = 3
df2 = df2.interpolate(axis=0, limit=int_how_many_consecutive_to_fill, limit_direction='forward')
查看有关插值的特定文档,您可以在方法中添加很多自定义功能并添加标记。
编辑:
使用每个插值的duration列中的行值来执行此操作,这有点混乱,但是我认为它应该可以工作(在熊猫或其他我不知道的库中使用某些功能可能会出现一些不太hacky,更简洁的解决方案的):
#get rows with nans in them:
nans_df = df2[df2.isnull()]
#get rows without nans in them:
non_nans_df = df2[~df2.isnull()]
#list of dfs we will concat vertically at the end to get final dataframe.
dfs = []
#iterate through each row that contains NaNs.
for nan_index, nan_row in nans_df.iterrows():
previous_day = nan_index - pd.DateOffset(1)
#this checks if the previous day to this NaN row is a day where we have non nan values, if the previous day is a nan day just skip this loop. This is mostly here to handle the case where the first row is a NaN one.
if previous_day not in non_nans_df.index:
continue
date_offset = 0
#here we are checking how many sequential rows there are after this one with all nan values in it, this will be stored in the date_offset variable.
while (nan_index + pd.DateOffset(date_offset)) in nans_df.index:
date_offset += 1
#this gets us the last date in the sequence of continuous days with all nan values after this current one.
end_sequence_date = nan_index + pd.DateOffset(date_offset)
#this gives us a dataframe where the first row in it is the previous day to this one(nan_index), confirmed to be non NaN by the first if statement in this for loop. It then combines this non NaN row with all the sequential nan rows after it into the variable df_to_interpolate.
df_to_interpolate = non_nans_df.iloc[previous_day].append(nans_df.iloc[nan_index:end_sequence_date])
# now we pull the duration value for the first row in our df_to_interpolate dataframe.
limit_val = int(df_to_interpolate['duration'][0])
#here we interpolate the dataframe using the limit_val
df_to_interpolate = df_to_interpolate.interpolate(axis=0, limit=limit_val, limit_direction='forward')
#append df_to_interpolate to our list that gets combined at the end.
dfs.append(df_to_interpolate)
#gives us our final dataframe, interpolated forward using a dynamic limit value based on the most recent duration value.
final_df = pd.concat(dfs)
答案 1 :(得分:1)
n = df.Comment_vol.str.strip().isin(['Pass', ''])
m = df.Comment_wt.str.strip().isin(['Pass', ''])
df['Comment_final'] = np.select([n, ~n & m], [df.Comment_wt, df.Comment_vol], df.Comment_vol.str.cat(df.Comment_wt, sep=', '))
Out[591]:
Comment_vol Comment_wt Comment_final
0 Pass wtA wtA
1 Pass Pass
2 VolA Pass VolA
3 Pass Pass Pass
4 wtA wtA
5 VolA wtA VolA, wtA
首先,我们创建一个填充日期的系列。然后,我们创建一个索引小于填充值的蒙版。然后我们根据蒙版填充。
如果要包括带有退房日期的行,请将m从<更改为<=