R:使用移位的数据行

时间:2018-05-05 19:19:31

标签: r dataframe

- 要使用的示例数据:

要创建简化示例,这是dput(df)的输出:

df <- structure(list(SubjectID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L), .Label = c("1", "2", "3"), class = "factor"), EventNumber = structure(c(1L, 
1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("1", "2"), class = "factor"), 
    EventType = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 
    1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
    ), .Label = c("A", "B"), class = "factor"), Param1 = c(0.3, 
    0.21, 0.87, 0.78, 0.9, 1.2, 1.4, 1.3, 0.6, 0.45, 0.45, 0.04, 
    0, 0.1, 0.03, 0.01, 0.09, 0.06, 0.08, 0.09, 0.03, 0.04, 0.04, 
    0.02), Param2 = c(45, 38, 76, 32, 67, 23, 27, 784, 623, 54, 
    54, 1056, 487, 341, 671, 859, 7769, 2219, 4277, 4060, 411, 
    440, 224, 57), Param3 = c(1.5, 1.7, 1.65, 1.32, 0.6, 0.3, 
    2.5, 0.4, 1.4, 0.67, 0.67, 0.32, 0.1, 0.15, 0.22, 0.29, 0.3, 
    0.2, 0.8, 1, 0.9, 0.8, 0.3, 0.1), Param4 = c(0.14, 0, 1, 
    0.86, 0, 0.6, 1, 1, 0.18, 0, 0, 0.39, 0, 1, 0.29, 0.07, 0.33, 
    0.53, 0.29, 0.23, 0.84, 0.61, 0.57, 0.59), Param5 = c(0.18, 
    0, 1, 0, 1, 0, 0.09, 1, 0.78, 0, 0, 1, 0.2, 0, 0.46, 0.72, 
    0.16, 0.22, 0.77, 0.52, 0.2, 0.68, 0.58, 0.17), Param6 = c(0, 
    1, 0.75, 0, 0.14, 0, 1, 0, 1, 0.27, 0, 1, 0, 0.23, 0.55, 
    0.86, 1, 0.33, 1, 1, 0.88, 0.75, 0, 0), AbsoluteTime = structure(c(1522533600, 
    1522533602, 1522533604, 1522533604, 1525125600, 1525125602, 
    1525125604, 1519254000, 1519254002, 1519254004, 1519254006, 
    1521759600, 1521759602, 1521759604, 1521759606, 1521759608, 
    1517353224, 1517353226, 1517353228, 1517353230, 1517439600, 
    1517439602, 1517439604, 1517439606), class = c("POSIXct", 
    "POSIXt"), tzone = "")), row.names = c(NA, -24L), class = "data.frame")
df

真实数据有20个主题,EventNumbers范围从1到100,参数从Param1到Param40(取决于实验)。 行数约为60 000观察。

- 我想要实现的目标:

对于df,创建n * 40个新列。 #(40或将在以后选择的任意数量的参数。)

n视为“迈向未来的步骤”。 将40 * n新创建的列命名为:

  

Param1_2,Param2_2,Param3_2,...,Param39_2,Param40_2,...,

     

Param1_3,Param2_3,Param3_3,...,Param39_3,Param40_3,...,

     

...

     

Param1_n,Param2_n,Param3_n,...,Param39_n,Param40_n

导致列

  

Param1_1,Param2_1,Param1_2,Param2_2,Param1_3,Param2_3,Param1_4,Param2_4,... Param1_n,Param2_n

因此,子集df[X, c(4:9)]的每次观察都会获得一组额外的变量,其值从df[X+1, c(4:9)]df[X+n, c(4:9)]

对于n = 1,这就是新的df.extended:

df.extended <- structure(list(SubjectID = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3), EventNumber = c(1, 1, 
1, 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 
2), EventType = c("A", "A", "A", "A", "B", "B", "B", "A", "A", 
"A", "A", "B", "B", "B", "B", "B", "A", "A", "A", "A", "B", "B", 
"B", "B"), Param1 = c(0.3, 0.21, 0.87, 0.78, 0.9, 1.2, 1.4, 1.3, 
0.6, 0.45, 0.45, 0.04, 0, 0.1, 0.03, 0.01, 0.05, 0.07, 0.06, 
0.01, 0.01, 0.01, 0.07, 0.04), Param2 = c(45, 38, 76, 32, 67, 
23, 27, 784, 623, 54, 54, 1056, 487, 341, 671, 859, 1858, 640, 
8181, 220, 99, 86, 170, 495), Param3 = c(1.5, 1.7, 1.65, 1.32, 
0.6, 0.3, 2.5, 0.4, 1.4, 0.67, 0.67, 0.32, 0.1, 0.15, 0.22, 0.29, 
1.5, 0.9, 0.8, 0.9, 0.1, 0, 0.8, 0.1), Param4 = c(0.14, 0, 1, 
0.86, 0, 0.6, 1, 1, 0.18, 0, 0, 0.39, 0, 1, 0.29, 0.07, 0.64, 
0.11, 0.12, 0.32, 0.55, 0.67, 0.83, 0.82), Param5 = c(0.18, 0, 
1, 0, 1, 0, 0.09, 1, 0.78, 0, 0, 1, 0.2, 0, 0.46, 0.72, 0.27, 
0.14, 0.7, 0.67, 0.23, 0.44, 0.61, 0.76), Param6 = c(0, 1, 0.75, 
0, 0.14, 0, 1, 0, 1, 0.27, 0, 1, 0, 0.23, 0.55, 0.86, 1, 0.56, 
0.45, 0.5, 0, 0, 0.89, 0.11), AbsoluteTime = c("2018-04-01 00:00:00", 
"2018-04-01 00:00:02", "2018-04-01 00:00:04", "2018-04-01 00:00:04", 
"2018-05-01 00:00:00", "2018-05-01 00:00:02", "2018-05-01 00:00:04", 
"2018-02-22 00:00:00", "2018-02-22 00:00:02", "2018-02-22 00:00:04", 
"2018-02-22 00:00:06", "2018-03-23 00:00:00", "2018-03-23 00:00:02", 
"2018-03-23 00:00:04", "2018-03-23 00:00:06", "2018-03-23 00:00:08", 
"2018-01-31 00:00:24", "2018-01-31 00:00:26", "2018-01-31 00:00:28", 
"2018-01-31 00:00:30", "2018-02-01 00:00:00", "2018-02-01 00:00:02", 
"2018-02-01 00:00:04", "2018-02-01 00:00:06"), Param1_2 = c(0.21, 
0.87, 0.78, NA, 1.2, 1.4, NA, 0.6, 0.45, 0.45, NA, 0, 0.1, 0.03, 
0.01, NA, 0.07, 0.07, 0.08, NA, 0.09, 0.06, 0.01, NA), Param2_2 = c(38, 
76, 32, NA, 23, 27, NA, 623, 54, 54, NA, 487, 341, 671, 859, 
NA, 6941, 4467, 808, NA, 143, 301, 219, NA), Param3_2 = c(1.7, 
1.65, 1.32, NA, 0.3, 2.5, NA, 1.4, 0.67, 0.67, NA, 0.1, 0.15, 
0.22, 0.29, NA, 1, 1, 0.1, NA, 0.5, 1, 0.3, NA), Param4_2 = c(0, 
1, 0.86, NA, 0.6, 1, NA, 0.18, 0, 0, NA, 0, 1, 0.29, 0.07, NA, 
0.31, 0.16, 0.68, NA, 0.86, 0.47, 0.47, NA), Param5_2 = c(0, 
1, 0, NA, 0, 0.09, NA, 0.78, 0, 0, NA, 0.2, 0, 0.46, 0.72, NA, 
0.29, 0.26, 0.1, NA, 0.88, 0.86, 0.95, NA), Param6_2 = c(1, 0, 
0, NA, 0, 1, NA, 1, 0.27, 0, NA, 0, 0.23, 0.55, 0.86, NA, 0.68, 
0.66, 0, NA, 0.44, 1, 0.22, NA)), row.names = c(NA, 24L), class = "data.frame")
df.extended

如何在不使用循环,手动编写列索引等的情况下解决这个问题?编写试验2的函数并使用doBy

我的想法和迄今为止我为解决这个问题所做的工作:

  1. 试用1:

    1. 循环访问for循环中的SubjectIDs
    2. 在内部for循环中,循环遍历EventNumber
    3. 在另一个内部for循环中,循环遍历行
    4. 抓住df [1,]获取第一行并保存到df.temp
    5. 将df.temp与df [2,参数]#
    6. 合并
    7. 合并df.temp与df [3,参数]等等
    8. 将所有生成的df.temp保存到df.final
    9. 我遇到的问题:第5步:

      df.temp <- df[1,]
      df.temp <- merge(df.temp, df[2, !(colnames(df) == "AbsoluteTime")], by = c("SubjectID", "EventNumber", "EventType"))
      df.temp <- merge(df.temp, df[3, !(colnames(df) == "AbsoluteTime")], by = c("SubjectID", "EventNumber", "EventType"))
      df.temp <- merge(df.temp, df[4, !(colnames(df) == "AbsoluteTime")], by = c("SubjectID", "EventNumber", "EventType"))
      Warning:
      In merge.data.frame(df.temp, df[4, ], by = c("SubjectID", "EventNumber",  :
        column names ‘Param1.x’, ‘Param2.x’, ‘Param3.x’, ‘Param4.x’, ‘Param5.x’, ‘Param6.x’, ‘AbsoluteTime.x’, ‘Param1.y’, ‘Param2.y’,
      

      'Param3.y','Param4.y','Param5.y','Param6.y','AbsoluteTime.y'是 结果重复。

      • 重复列名称,请参阅警告。
      • 我无法弄清楚如何根据给定的列名和变量轻松创建列名/重命名新列。

      必须有一个比这更好的方法:

      n <- 3 
      names_vector <- c()
      for (n in seq(from = c(1), to = n)) {
        for (i in names(df[4:9])) {
        names_vector <- c(names_vector, paste0(i, "_", c(n+1)))
          }
      }
      names(df.temp)[c(4:9)] <- parameters
      names(df.temp)[c(11:ncol(df.temp))] <- names_vector
      names(df.temp)
      
      • 另外,如何防止最后n-1行破坏脚本?这是手工做的很多工作,我认为很容易出错!?
    10. 试用2:

      1. 循环访问for循环中的SubjectIDs
      2. 在内部for循环中,循环遍历EventNumber
      3. 将所有参数行放入除第一行以外的新数据框
      4. 使用NAs附加一行
      5. 使用cbind()合并行
      6. 重复n次。
      7. 这是一个SubjectID和一个EventNumber的代码:

        df.temp <- df[which(df$SubjectID == "1" & df$EventNumber == "1"), ]
        df.temp2 <- df.temp[2:nrow(df.temp)-1, parameters]
        df.temp2 <- rbind(df.temp2, NA)
        df.temp <- cbind(df.temp, df.temp2)
        df.temp2 <- df.temp[3:nrow(df.temp)-1, parameters]
        df.temp2 <- rbind(df.temp2, NA, NA)
        df.temp <- cbind(df.temp, df.temp2)
        df.temp2 <- df.temp[4:nrow(df.temp)-1, parameters]
        df.temp2 <- rbind(df.temp2, NA, NA, NA)
        df.temp <- cbind(df.temp, df.temp2)
        n <- 3
        names_vector <- c()
        for (n in seq(from = c(1), to = n)) {
          for (i in names(df[4:9])) {
            print(i)
            print(n)
            names_vector <- c(names_vector, paste0(i, "_", c(n+1)))
          }
        }
        names(df.temp)[c(4:9)] <- parameters
        names(df.temp)[c(11:ncol(df.temp))] <- names_vector
        df.temp
        
        • 解决了缺少行的问题(在我的情况下,NA是可以接受的)。
        • 手工/循环仍然有很多工作,容易出错!?

2 个答案:

答案 0 :(得分:1)

这样的事情:

您可以使用包dplyr的{​​{3}}根据数据中感兴趣的各种子集添加和重命名变量。 dplyr还提供了函数lead()lag(),可用于查找向量(或此处行)中的“下一个”或“上一个”值。您可以将lead()与函数mutate_at()结合使用,从后续的“第n个”-row中提取值,并使用它们创建新的变量集。

在这里,我使用您在示例中提供的数据:

# load dplyr package
require(dplyr)

# creacte new data frame "df.extended"
df.extended <- df

# number of observations per group (e.g., SubjectID)
# or desired number of successions
obs = 3

# loop until number of successions achieved
for (i in 1:obs) {

  # overwrite df.extended with new information
   df.extended <- df.extended %>% 
     # group by subjects and events
     group_by(SubjectID, EventNumber) %>%
     # create new variable for each parameter
     mutate_at( vars(Param1:Param6), 
                # using the lead function
                .funs = funs(step = lead),
                # for the nth followning row
                n = i) %>% 
     # rename the new variables to show the succession number
     rename_at(vars(contains("_step")), funs(sub("step", as.character(i), .)))

}

这应该粗略地重新创建您发布的数据作为期望的结果。

# Look at first part of "df.extended"
> head(df.extended)

# A tibble: 6 x 28
# Groups:   SubjectID, EventNumber [2]
  SubjectID EventNumber EventType Param1 Param2 Param3 Param4 Param5 Param6 AbsoluteTime        Param1_1 Param2_1 Param3_1 Param4_1 Param5_1 Param6_1
  <fct>     <fct>       <fct>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dttm>                 <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1 1         1           A          0.300    45.  1.50   0.140  0.180  0.    2018-04-01 00:00:00    0.210      38.    1.70     0.      0.        1.00 
2 1         1           A          0.210    38.  1.70   0.     0.     1.00  2018-04-01 00:00:02    0.870      76.    1.65     1.00    1.00      0.750
3 1         1           A          0.870    76.  1.65   1.00   1.00   0.750 2018-04-01 00:00:04    0.780      32.    1.32     0.860   0.        0.   
4 1         1           A          0.780    32.  1.32   0.860  0.     0.    2018-04-01 00:00:04   NA          NA    NA       NA      NA        NA    
5 1         2           B          0.900    67.  0.600  0.     1.00   0.140 2018-05-01 00:00:00    1.20       23.    0.300    0.600   0.        0.   
6 1         2           B          1.20     23.  0.300  0.600  0.     0.    2018-05-01 00:00:02    1.40       27.    2.50     1.00    0.0900    1.00 
# ... with 12 more variables: Param1_2 <dbl>, Param2_2 <dbl>, Param3_2 <dbl>, Param4_2 <dbl>, Param5_2 <dbl>, Param6_2 <dbl>, Param1_3 <dbl>,
#   Param2_3 <dbl>, Param3_3 <dbl>, Param4_3 <dbl>, Param5_3 <dbl>, Param6_3 <dbl>

答案 1 :(得分:1)

对于基础R,请考虑by SubjectID EventNumber EventType 进行切片,然后运行{{1使用帮助器 group_num 。要运行一系列参数,请在merge中包装by进程,以获取在外部链接合并的数据帧列表,以便与原始数据帧进行最终合并:

lapply

<强>输出

df_list <- lapply(2:3, function(i) {
  # BUILD LIST OF DATAFRAMES
  by_list <- by(df, df[c("SubjectID", "EventNumber", "EventType")], FUN=function(sub){

    sub$grp_num <- 1:nrow(sub)
    row_less_sub <- transform(sub, AbsoluteTime=NULL, grp_num=grp_num-(i-1))

    merge(sub, row_less_sub, by=c("SubjectID", "EventNumber", "EventType", "grp_num"), 
          all.x=TRUE, suffixes = c("", paste0("_", i)))
  })

  # APPEND ALL DATAFRAMES IN LIST
  grp_df <- do.call(rbind, by_list)
  grp_df <- with(grp_df, grp_df[order(SubjectID, EventNumber),])
  # KEEP NEEDED COLUMNS
  grp_df <- grp_df[c("SubjectID", "EventNumber", "EventType", "grp_num",
                   names(grp_df)[grep("Param[0-9]_", names(grp_df))])]
  row.names(grp_df) <- NULL

  return(grp_df)
})

# ALL PARAMS_* CHAIN MERGE
params_df <- Reduce(function(x,y) merge(x, y, by=c("SubjectID", "EventNumber", "EventType", "grp_num")), df_list)

# ORIGINAL DF AND PARAMS MERGE
df$grp_num <- ave(df$Param1, df$SubjectID, df$EventNumber, df$EventType, 
                  FUN=function(x) cumsum(rep(1, length(x))))

final_df <- transform(merge(df, params_df, by=c("SubjectID", "EventNumber", "EventType", "grp_num")), grp_num=NULL)