Question

考虑包含带有开始和结束日期的雇主 - 雇员链接的数据。

   employer  employee      start        end
0         0         0 2007-01-01 2007-12-31
1         1        86 2007-01-01 2007-12-31
2         1        63 2007-06-01 2007-12-31
3         1        93 2007-01-01 2007-12-31

现在我想“传播”日期，即在start和end之间的每个月创建一个观察点。我以为

def extend(x):
    index = pd.date_range(start=x['start'], end=x['end'], freq='M')
    df = pd.DataFrame([x.values], index=index, columns=x.index)
    return df

long = df.apply(extend, axis=1)

可以做到这一点，但是，它只包含索引：

>>> long.head()
Out[245]: 
   employer  employee  start  end
0  employer  employee  start  end
1  employer  employee  start  end

然而，当我在第一行测试时，这有效：

>>> extend(df.iloc[0])
Out[246]: 
            employer  employee      start        end
2007-01-31         0         0 2007-01-01 2007-12-31
2007-02-28         0         0 2007-01-01 2007-12-31
2007-03-31         0         0 2007-01-01 2007-12-31
(...)

我做错了什么？或许，有更好的方法吗？我的最终目标是将输出作为前一个输出，但格式为employer employee month year

Answer 1

我认为问题是apply期望返回与输入相同的行数。

您可以使用iterrows和列表理解来完成，而无需对代码进行太多修改：

def extend(x):
    index = pd.date_range(start=x['start'], end=x['end'], freq='M')
    df = pd.DataFrame([x.values], index=index, columns=x.index)
    return df

>>> new = pd.concat([extend(x) for _,x in df.iterrows()])
>>> new

            employer  employee      start        end
2007-01-31         0         0 2007-01-01 2007-12-31
2007-02-28         0         0 2007-01-01 2007-12-31
2007-03-31         0         0 2007-01-01 2007-12-31
2007-04-30         0         0 2007-01-01 2007-12-31
2007-05-31         0         0 2007-01-01 2007-12-31
2007-06-30         0         0 2007-01-01 2007-12-31
2007-07-31         0         0 2007-01-01 2007-12-31
2007-08-31         0         0 2007-01-01 2007-12-31
2007-09-30         0         0 2007-01-01 2007-12-31
2007-10-31         0         0 2007-01-01 2007-12-31
2007-11-30         0         0 2007-01-01 2007-12-31
2007-12-31         0         0 2007-01-01 2007-12-31
2007-01-31         1        86 2007-01-01 2007-12-31
2007-02-28         1        86 2007-01-01 2007-12-31
2007-03-31         1        86 2007-01-01 2007-12-31
2007-04-30         1        86 2007-01-01 2007-12-31
2007-05-31         1        86 2007-01-01 2007-12-31
2007-06-30         1        86 2007-01-01 2007-12-31
2007-07-31         1        86 2007-01-01 2007-12-31
2007-08-31         1        86 2007-01-01 2007-12-31
2007-09-30         1        86 2007-01-01 2007-12-31
2007-10-31         1        86 2007-01-01 2007-12-31
2007-11-30         1        86 2007-01-01 2007-12-31
2007-12-31         1        86 2007-01-01 2007-12-31
2007-06-30         1        63 2007-06-01 2007-12-31
2007-07-31         1        63 2007-06-01 2007-12-31
2007-08-31         1        63 2007-06-01 2007-12-31
2007-09-30         1        63 2007-06-01 2007-12-31
2007-10-31         1        63 2007-06-01 2007-12-31
2007-11-30         1        63 2007-06-01 2007-12-31
2007-12-31         1        63 2007-06-01 2007-12-31
2007-01-31         1        93 2007-01-01 2007-12-31
2007-02-28         1        93 2007-01-01 2007-12-31
2007-03-31         1        93 2007-01-01 2007-12-31
2007-04-30         1        93 2007-01-01 2007-12-31
2007-05-31         1        93 2007-01-01 2007-12-31
2007-06-30         1        93 2007-01-01 2007-12-31
2007-07-31         1        93 2007-01-01 2007-12-31
2007-08-31         1        93 2007-01-01 2007-12-31
2007-09-30         1        93 2007-01-01 2007-12-31
2007-10-31         1        93 2007-01-01 2007-12-31
2007-11-30         1        93 2007-01-01 2007-12-31
2007-12-31         1        93 2007-01-01 2007-12-31

您也可以使用groupby/apply，因为它更灵活。如下所示：

def extend(x):
    x = x.iloc[0,:]
    dates = pd.date_range(start=x['start'], end=x['end'], freq='M')
    return pd.DataFrame(dates,columns=['date'])

>>> long = df.groupby(['employer','employee'])[['start','end']].apply(extend)
>>> long

                           date
employer employee
0        0        0  2007-01-31
                  1  2007-02-28
                  2  2007-03-31
                  3  2007-04-30
                  4  2007-05-31
                  5  2007-06-30
                  6  2007-07-31
                  7  2007-08-31
                  8  2007-09-30
                  9  2007-10-31
                  10 2007-11-30
                  11 2007-12-31
1        63       0  2007-06-30
                  1  2007-07-31
                  2  2007-08-31
                  3  2007-09-30
                  4  2007-10-31
                  5  2007-11-30
                  6  2007-12-31
         86       0  2007-01-31
                  1  2007-02-28
                  2  2007-03-31
                  3  2007-04-30
                  4  2007-05-31
                  5  2007-06-30
                  6  2007-07-31
                  7  2007-08-31
                  8  2007-09-30
                  9  2007-10-31
                  10 2007-11-30
                  11 2007-12-31
         93       0  2007-01-31
                  1  2007-02-28
                  2  2007-03-31
                  3  2007-04-30
                  4  2007-05-31
                  5  2007-06-30
                  6  2007-07-31
                  7  2007-08-31
                  8  2007-09-30
                  9  2007-10-31
                  10 2007-11-30
                  11 2007-12-31

或者可以遍历行concat

每个开始，结束日期使用data_range扩展数据帧

1 个答案: