考虑包含带有开始和结束日期的雇主 - 雇员链接的数据。
employer employee start end
0 0 0 2007-01-01 2007-12-31
1 1 86 2007-01-01 2007-12-31
2 1 63 2007-06-01 2007-12-31
3 1 93 2007-01-01 2007-12-31
现在我想“传播”日期,即在start
和end
之间的每个月创建一个观察点。我以为
def extend(x):
index = pd.date_range(start=x['start'], end=x['end'], freq='M')
df = pd.DataFrame([x.values], index=index, columns=x.index)
return df
long = df.apply(extend, axis=1)
可以做到这一点,但是,它只包含索引:
>>> long.head()
Out[245]:
employer employee start end
0 employer employee start end
1 employer employee start end
然而,当我在第一行测试时,这有效:
>>> extend(df.iloc[0])
Out[246]:
employer employee start end
2007-01-31 0 0 2007-01-01 2007-12-31
2007-02-28 0 0 2007-01-01 2007-12-31
2007-03-31 0 0 2007-01-01 2007-12-31
(...)
我做错了什么?或许,有更好的方法吗?我的最终目标是将输出作为前一个输出,但格式为employer employee month year
答案 0 :(得分:0)
我认为问题是apply
期望返回与输入相同的行数。
您可以使用iterrows
和列表理解来完成,而无需对代码进行太多修改:
def extend(x):
index = pd.date_range(start=x['start'], end=x['end'], freq='M')
df = pd.DataFrame([x.values], index=index, columns=x.index)
return df
>>> new = pd.concat([extend(x) for _,x in df.iterrows()])
>>> new
employer employee start end
2007-01-31 0 0 2007-01-01 2007-12-31
2007-02-28 0 0 2007-01-01 2007-12-31
2007-03-31 0 0 2007-01-01 2007-12-31
2007-04-30 0 0 2007-01-01 2007-12-31
2007-05-31 0 0 2007-01-01 2007-12-31
2007-06-30 0 0 2007-01-01 2007-12-31
2007-07-31 0 0 2007-01-01 2007-12-31
2007-08-31 0 0 2007-01-01 2007-12-31
2007-09-30 0 0 2007-01-01 2007-12-31
2007-10-31 0 0 2007-01-01 2007-12-31
2007-11-30 0 0 2007-01-01 2007-12-31
2007-12-31 0 0 2007-01-01 2007-12-31
2007-01-31 1 86 2007-01-01 2007-12-31
2007-02-28 1 86 2007-01-01 2007-12-31
2007-03-31 1 86 2007-01-01 2007-12-31
2007-04-30 1 86 2007-01-01 2007-12-31
2007-05-31 1 86 2007-01-01 2007-12-31
2007-06-30 1 86 2007-01-01 2007-12-31
2007-07-31 1 86 2007-01-01 2007-12-31
2007-08-31 1 86 2007-01-01 2007-12-31
2007-09-30 1 86 2007-01-01 2007-12-31
2007-10-31 1 86 2007-01-01 2007-12-31
2007-11-30 1 86 2007-01-01 2007-12-31
2007-12-31 1 86 2007-01-01 2007-12-31
2007-06-30 1 63 2007-06-01 2007-12-31
2007-07-31 1 63 2007-06-01 2007-12-31
2007-08-31 1 63 2007-06-01 2007-12-31
2007-09-30 1 63 2007-06-01 2007-12-31
2007-10-31 1 63 2007-06-01 2007-12-31
2007-11-30 1 63 2007-06-01 2007-12-31
2007-12-31 1 63 2007-06-01 2007-12-31
2007-01-31 1 93 2007-01-01 2007-12-31
2007-02-28 1 93 2007-01-01 2007-12-31
2007-03-31 1 93 2007-01-01 2007-12-31
2007-04-30 1 93 2007-01-01 2007-12-31
2007-05-31 1 93 2007-01-01 2007-12-31
2007-06-30 1 93 2007-01-01 2007-12-31
2007-07-31 1 93 2007-01-01 2007-12-31
2007-08-31 1 93 2007-01-01 2007-12-31
2007-09-30 1 93 2007-01-01 2007-12-31
2007-10-31 1 93 2007-01-01 2007-12-31
2007-11-30 1 93 2007-01-01 2007-12-31
2007-12-31 1 93 2007-01-01 2007-12-31
您也可以使用groupby/apply
,因为它更灵活。如下所示:
def extend(x):
x = x.iloc[0,:]
dates = pd.date_range(start=x['start'], end=x['end'], freq='M')
return pd.DataFrame(dates,columns=['date'])
>>> long = df.groupby(['employer','employee'])[['start','end']].apply(extend)
>>> long
date
employer employee
0 0 0 2007-01-31
1 2007-02-28
2 2007-03-31
3 2007-04-30
4 2007-05-31
5 2007-06-30
6 2007-07-31
7 2007-08-31
8 2007-09-30
9 2007-10-31
10 2007-11-30
11 2007-12-31
1 63 0 2007-06-30
1 2007-07-31
2 2007-08-31
3 2007-09-30
4 2007-10-31
5 2007-11-30
6 2007-12-31
86 0 2007-01-31
1 2007-02-28
2 2007-03-31
3 2007-04-30
4 2007-05-31
5 2007-06-30
6 2007-07-31
7 2007-08-31
8 2007-09-30
9 2007-10-31
10 2007-11-30
11 2007-12-31
93 0 2007-01-31
1 2007-02-28
2 2007-03-31
3 2007-04-30
4 2007-05-31
5 2007-06-30
6 2007-07-31
7 2007-08-31
8 2007-09-30
9 2007-10-31
10 2007-11-30
11 2007-12-31
或者可以遍历行concat