Question

我有一个如下所示的df：

President   Start Date  End Date
B Clinton   1992-01-01  1999-12-31
G Bush      2000-01-01  2007-12-31
B Obama     2008-01-01  2015-12-31
D Trump     2016-01-01  2019-12-31 # not too far away!!

我想创建另一个df，类似这样

timestamp   President
1992-01-01  B Clinton
1992-01-02  B Clinton
...
2000-01-01  G Bush
...

基本上我想创建一个数据框，其索引为时间戳，然后根据另一个df的两列上的条件选择其内容。

我觉得大熊猫内部有一种方法可以做到这一点，但我不确定如何做到。我尝试使用np.piecewise，但似乎对我而言很难产生条件。我该怎么办？

Answer 1

这是另一个unnesting问题

df['New']=[pd.date_range(x,y).tolist() for x , y in zip (df.StartDate,df.EndDate)]

unnesting(df,['New'])

仅供参考，我已在此处粘贴函数

def unnesting(df, explode):
    idx=df.index.repeat(df[explode[0]].str.len())
    df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
    df1.index=idx
    return df1.join(df.drop(explode,1),how='left')

Answer 2

您可以使用pd.date_range从起始值和结束值创建日期范围。确保开始日期和结束日期为日期时间格式。

s = df.set_index('President').apply(lambda x: pd.Series(pd.date_range(x['Start Date'], x['End Date'])), axis = 1).stack().reset_index(1, drop = True)

new_df = pd.DataFrame(s.index.values, index=s, columns = ['President'] )



            President
1992-01-01  B Clinton
1992-01-02  B Clinton
1992-01-03  B Clinton
1992-01-04  B Clinton
1992-01-05  B Clinton
1992-01-06  B Clinton
1992-01-07  B Clinton
1992-01-08  B Clinton
1992-01-09  B Clinton

Answer 3

也许您可以使用PeriodIndex而不是DatetimeIndex，因为您要处理的是规则间隔的时间间隔，即年。

# create a list of PeriodIndex objects with annual frequency
p_idxs = [pd.period_range(start, end, freq='A') for idx, (start, end) in df[['Start Date', 'End Date']].iterrows()]

# for each PeriodIndex create a DataFrame where 
# the number of president instances matches the length of the PeriodIndex object
df_list = []
for pres, p_idx in zip(df['President'].tolist(), p_idxs):
    df_ = pd.DataFrame(data=len(p_idx)*[pres], index=p_idx)
    df_list.append(df_)

# concatenate everything to get the desired output
df_desired = pd.concat(df_list, axis=0)

如何在大熊猫中进行复杂选择？

3 个答案: