Question

我有一个data.frame，目前每行有一条记录，但我想将其转换为每行有三条记录（为机器学习算法提供更多趋势数据）。

作为一个例子，我的data.frame目前看起来像这样（但是除了Rank和Speed之外还有更多的变量）：

Date  | Participant | Ctry | Rank | Speed
----- |-------------|------|------|-------
17/01 | 1           | AU   | 1    | 0.9   
18/01 | 1           | AU   | 4    | 0.6   
19/01 | 1           | AU   | 2    | 0.7   
20/01 | 1           | AU   | 1    | 0.4   
17/01 | 2           | ZA   | 5    | 0.3   
18/01 | 2           | ZA   | 3    | 0.5   
19/01 | 2           | ZA   | 4    | 0.6

我想将其转换为这样（在每个参与者的3个滚动窗口中）：

StartDate  | Participant | Ctry | Rank_1 | Rank_2 | Rank_3 | Speed_1 | Speed_2 | Speed_3
---------- | ----------- | ---- | ------ | ------ | ------ | ------- | ------- | -------
17/01      | 1           | AU   | 1      | 4      | 2      | 0.9     | 0.6     | 0.7
18/01      | 1           | AU   | 4      | 2      | 1      | 0.6     | 0.7     | 0.4
17/01      | 2           | ZA   | 5      | 3      | 4      | 0.3     | 0.5     | 0.6

我可以使用嵌套的for循环来创建这个数据结构，但我确信有一种更有效的方法。我已经研究了reshape（2）和dplyr函数，但是找不到适用于滚动多变量窗口的东西。

Answer 1

OP要求将数据从长格式重新整形为一种特殊的宽格式，最后每行包含三条记录。例如，参与者1将包含一行，其中包含17/01，18/01和19/01的值，第二行包含{{1}的值}}，18/01和19/01。

请注意，此操作将添加冗余数据，因为重塑后某些值最多可能会出现三次。另请注意，OP已请求同时重塑多个值变量。此功能已添加到20/01包的最新版本中。

以下是使用data.table包中的shift()，melt()，dcast()，rowid()和加入的解决方案：

data.table

library(data.table)
# define number of records per row
n_recs <- 3L
# create sequences of dates to be included per row using shift() with multiple offsets,
# keep only complete sequences, add StartDate column for later dcast()
windows <- na.omit(DT[, shift(Date, seq_len(n_recs) - 1L, type = "lead"), by = Participant])[
  , StartDate := V1]
# reshape to long form for later join, 
# rename variables for automatic creation of column names in dcast()
lwin <- melt(windows, id.vars = c("Participant", "StartDate"), value.name = "Date")[
    , variable := stringi::stri_replace(variable, fixed = "V", "")]
# right join with original data to create additional rows,
# reshape from long to wide form using multiple value vars,
# reorder for convenience 
dcast(
  DT[lwin, on = .(Participant, Date)], 
  StartDate + Participant + Ctry ~ variable, value.var = c("Rank", "Speed"))[
    order(Participant, StartDate)]

数据

   StartDate Participant Ctry Rank_1 Rank_2 Rank_3 Speed_1 Speed_2 Speed_3
1:     17/01           1   AU      1      4      2     0.9     0.6     0.7
2:     18/01           1   AU      4      2      1     0.6     0.7     0.4
3:     17/01           2   ZA      5      3      4     0.3     0.5     0.6

修改

我已经认识到上面的代码依赖于隐含的假设，即每个参与者的记录至少应与记录相结合。 OP的样本数据包含4行参与者library(data.table) DT <- fread( "Date | Participant | Ctry | Rank | Speed 17/01 | 1 | AU | 1 | 0.9 18/01 | 1 | AU | 4 | 0.6 19/01 | 1 | AU | 2 | 0.7 20/01 | 1 | AU | 1 | 0.4 17/01 | 2 | ZA | 5 | 0.3 18/01 | 2 | ZA | 3 | 0.5 19/01 | 2 | ZA | 4 | 0.6 ", sep = "|" )和3行参与者1，因此满足此条件。

但是，如果每个参与者只有一行或两行，2将完全从最终结果中删除这些参与者。或许，这可能是OP的目标所希望的。如果不，则需要按如下方式修改代码：

na.omit()

# create new sample data including cases with less than 3 records per participant
DT <- fread(
  "Date  | Participant | Ctry | Rank | Speed
  17/01 | 1           | AU   | 1    | 0.9   
  18/01 | 1           | AU   | 4    | 0.6   
  19/01 | 1           | AU   | 2    | 0.7   
  20/01 | 1           | AU   | 1    | 0.4   
  17/01 | 2           | ZA   | 5    | 0.3   
  18/01 | 2           | ZA   | 3    | 0.5   
  19/01 | 2           | ZA   | 4    | 0.6   
  17/01 | 3           | DE   | 2    | 0.8,
  17/01 | 4           | DK   | 3    | 0.8,
  18/01 | 4           | DK   | 4    | 0.8",
  sep = "|"
) 

# modified code
n_recs <- 3L
min_rows <- 1L
windows <- DT[, lapply(shift(Date, seq_len(n_recs) - 1L, type = "lead"), 
                       head, n = pmax(.N - n_recs + 1L, min_rows)), 
              by = Participant][, StartDate := V1]
lwin <- melt(windows, id.vars = c("Participant", "StartDate"), value.name = "Date", 
             na.rm = TRUE)[
  , variable := stringi::stri_replace(variable, fixed = "V", "")]
dcast(
  DT[lwin, on = .(Participant, Date)], 
  StartDate + Participant + Ctry ~ variable, value.var = c("Rank", "Speed"))[
    order(Participant, StartDate)]

请注意＆＃34;不完整＆＃34;第4行和第5行由于参与者3和4缺少输入数据。但是，确保所有参与者出现在最终结果中。

这是通过在计算StartDate Participant Ctry Rank_1 Rank_2 Rank_3 Speed_1 Speed_2 Speed_3 1: 17/01 1 AU 1 4 2 0.9 0.6 0.7 2: 18/01 1 AU 4 2 1 0.6 0.7 0.4 3: 17/01 2 ZA 5 3 4 0.3 0.5 0.6 4: 17/01 3 DE 2 NA NA 0.8, NA NA 5: 17/01 4 DK 3 4 NA 0.8, 0.8 NA时使用head()明确限制为每个参与者创建的行数来实现的。此外，现在必须使用参数windows调用melt()。

如果na.rm = TRUE设置为min_rows，则不完整的第4行和第5行将从最终结果中消失。

将行分组为3个滚动集，并将每个集合组成一行

1 个答案:

数据

修改