data.frame操作和组合生成列表

时间:2016-12-04 00:26:27

标签: r list sorting dataframe

标题不是最好的,但这是一个复杂的任务,用一句话来解释。我正致力于动态生成三个数据框列表,其中所有列表中的每个元素在列表的相同顺序中彼此对应。例如L1 [[1]],L2 [[1]],L3 [[1]],表示彼此相关的数据等。我已经在这上工作了大约35个小时而且无法计算以简化的方式使这项工作。

有两个主要的数据来源,我正在调用的数据" sort"和" ref"用于排序和参考。

这是参考数据的非常简化版本。实际数据来自多个.csv文件。此数据可以包含n个具有静态列名称的行,其中ID表示行的ID,每列表示一个变量。附加的R表示"右"缺少R代表"左":

R1 <- c(1,200,201,20,21,300,301,30,31,400,401,40,41)
R2 <- c(2,201,202,21,22,301,302,31,32,401,402,41,42)
R3 <- c(3,200,201,20,21,300,301,30,31,NA,NA,NA,NA)
R4 <- c(4,201,202,21,22,NA,NA,NA,NA,401,402,41,42)
ref <- data.frame(rbind(R1,R2,R3,R4))
colnames(ref) <- c("ID","H1","H1R","H2","H2R","U1","U1R","U2","U2R","R1","R1R","R2","R2R")

排序数据类似,但格式不同。它具有相应的列名称,但不附加R但在不同的侧列中单独列出:

R1 <- c(1,"left","H",1,200,20,NA,NA,NA,NA)
R2 <- c(2,"right","H",2,201,21,NA,NA,NA,NA)
R3 <- c(3,"left","R",2,NA,NA,NA,NA,400,40)
R4 <- c(4,"right","R",3,NA,NA,NA,NA,401,41)
R5 <- c(5,"left","U",2,NA,NA,300,30,NA,NA)
R6 <- c(6,"right","U",5,NA,NA,301,31,NA,NA)
sort <- data.frame(rbind(R1,R2,R3,R4,R5,R6))
colnames(sort) <- c("ID","Side","Element","Individual","H1","H2","U1","U2","R1","R2")

我尝试开发的代码将使用另外两个元素名称对象。第一个可以包含任意数量的元素,而第二个元素总是包含一个元素。这些对应于&#34;元素&#34;排序和参考数据中的变量。例如:

B1 <- c("H","U")
B2 <- "R"

第一项任务是使用以下代码将排序数据拆分为B1和B2:

sort1 <- array()
for (i in B1) {
    sort1 <- rbind(sort1, sort[sort$Element == i,])
}
sort1 <- sort1[-1,] #removes first row from binding NA row
sort2 <- sort[sort$Element == B2,]

现在sort1和sort2。下一个任务是使用以下代码将sort1分组为相同的单个数字:

a1 <- list(data.frame())
j <- 1
for (i in unique(sort1$Individual)) {
    a1[[j]] <- sort1[sort1$Individual == i,]
    j <- j + 1
}
sort1 <- a1

现在,在这个阶段,我们有一个列表sort1,它包含具有相同个体编号的行的数据框。 Sort2,其中包含仅包含单个行的元素,ref包含所有变量的引用。现在我需要在sort1和sort2之间创建唯一的组合,其中sort1中的每个数据帧都与sort2组合,只要sort1中的side和element不在sort1中,使用以下代码:

a1 <- list(data.frame())
a2 <- list(data.frame())
x <- 1  
for(i in 1:length(sort1)) {
    for(j in 1:nrow(sort2)) {
        if(sort1[[i]]$Element != sort2[j,]$Element || sort1[[i]]$Side != sort2[j,]$Side) {
            a1[[x]] <- sort1[[i]][,colSums(is.na(sort1[[i]])) < nrow(sort1[[i]])] #removes NAs
            a2[[x]] <- sort2[j,][,colSums(is.na(sort2[j,])) < nrow(sort2[j,])] #removes NAs
        }
        x <- x + 1
    }
}

现在a1和a2都是包含相应组合的列表,其中列表的每个索引彼此对应。所以a1 [[1]]对应于a2 [[1]],依此类推。

所有这些代码都可以实现。现在我需要做的是创建一个参考数据列表,类似于我分割排序数据的方式。参考标准如下。引用的ID必须是每个组合的所有变量的相同(IE每个a1 [[1]],a2 [[2]]),这意味着每个唯一组合的参考数据中不能有任何NA。

例如,对于a1 [[4]]和a2 [[4]],我需要提取对应于可用侧和参考数据的参考数据。

a1[[4]]:
ID  Side   Element Individual   H1   H2   U1   U2
2   right     H        2        201   21 <NA> <NA>
5   left      U        2       <NA> <NA>  300  30

a2[[4]]:
ID  Side    Element Individual  R1 R2
4   right       R       3      401 41

我需要a3 [[4]]看起来没有NA,并且每个变量都有可用于该单个ID的数据。此时,所有三个data.frames列表都有相应的数据。:

ID    H1R   H2R  U1    U2    R1R   R2R  
1     201,  21,  300,  30,   401,  41
2     202,  22,  301,  31,   402,  42

以下代码是我对此的尝试,但它目前已被破坏且设计可怕:

zz <- 1
REF1 <- list(data.frame())
myfun <- function(x,y) {
    names1 <- colnames(x[5:ncol(x)])
    names2 <- colnames(y[5:ncol(y)])
    names <- c(names1, names2)

    IND1 <- data.frame()
    IND2 <- data.frame()
    for(n in names1) {
        for(i in nrow(x)) {
            if(x[i,]$Side == "left") {
                if(!is.na(x[i,][[n]])) {
                    name <- n
                    IND1[zz] <- ref[[x[i,]$Element]][[name]]
                }
            }
            if(x[i,]$Side == "right") {
                if(!is.na(x[i,][[n]])) {
                    name <- paste(n, "R", sep="")
                    IND2[zz] <- ref[[x[i,]$Element]][[name]]
                }
            }
        }
    }
    DEP1 <- data.frame()
    DEP2 <- data.frame()
    for(n in names2) {
        if(y$Side == "left") {
            if(!is.na(y[[n]])) {
                name <- n
                DEP1[zz] <- ref[[y$Element]][[name]]
            }
        }
        if(y$Side == "right") {
            if(!is.na(y[[n]])) {
                name <- paste(n, "R", sep="")
                DEP2[zz] <- ref[[y$Element]][[name]]
            }
        }
    }#names

    REF1[[zz]] <- cbind(IND1, IND2, DEP1, DEP2)
    zz <- zz + 1
    return(REF1)
}#myfun
output1 <- mapply(myfun, x = a1, y = a2)

非常感谢任何帮助。我试着简化这个问题。如果我需要澄清任何事情,请告诉我!最终目标是在各种统计测试中使用三个数据帧列表,其中每个列表的每个索引表示包括要使用的参考数据的单个组合。

EDITED:即使列名对于变量是静态的,我也不能直接在代码中指定它们,因为它们可能使用也可能不使用(ID,Side,Element和Individual除外)。实际上,我有多达185个不同的变量和完整的数据集。

已编辑:以下是排序数据的所需组合。不在任何特定的数据框架中,而只是视觉表示组合。

[[1]]
   ID Side Element Individual  H1 H2                  ID Side Element Individual  R1 R2
    1 left       H          1 200 20                   3 left       R          2 400 40


[[2]]
   ID Side Element Individual  H1 H2                  ID  Side Element Individual  R1 R2
    1 left       H          1 200 20                   4 right       R          3 401 41


[[3]]
   ID  Side Element Individual   H1   H2   U1   U2    ID Side Element Individual  R1 R2
    2 right       H          2  201   21 <NA> <NA>     3 left       R          2 400 40
    5  left       U          2 <NA> <NA>  300   30     


[[4]]
   ID  Side Element Individual   H1   H2   U1   U2    ID  Side Element Individual  R1 R2
    2 right       H          2  201   21 <NA> <NA>     4 right       R          3 401 41
    5  left       U          2 <NA> <NA>  300   30  


[[5]]
   ID  Side Element Individual  U1 U2                 ID Side Element Individual  R1 R2
   6 right       U          5 301 31                   3 left       R          2 400 40


[[6]]
   ID  Side Element Individual  U1 U2                 ID  Side Element Individual  R1 R2
   6 right       U          5 301 31                   4 right       R          3 401 41

1 个答案:

答案 0 :(得分:0)

注意:这不是一个解决方案,但我认为你可以用来帮助指导或澄清我们在这里尝试实现的更多,因为可能有一个更简单的方法。我一直处于35小时的冲击状态,试图堵住方形钉,所以我得到它。到目前为止,代码执行以下操作:

  1. sort的数据集拆分为data.frames
  2. 列表
  3. 基于B1,B2 sort1 data.frames列表再次拆分数据框

    f <- function(x_tbl){
     x_tbl %>% (function(x){
     # find string occurances of "H|U" in Element column and filter
     a <- x %>% filter(grepl(paste0(B1, collapse = "|"), Element))
     # filter sort table where the element is "R"
     b <- x[x$Element == B2,]
     # combine into list
     new_split <- list(a,b)
     # naming just so I can follow allong here
     names(new_split) <- c('sort1','sort2')
     # Splitting the sort1 table again, now by the Individual column
     new_split[['sort1']] <- split(
     new_split[['sort1']], new_split[['sort1']][['Individual']])
     # Now have the list of variable tables and key tables...
     # not sure what you're trying to say after this point
      return(new_split)
     })
    }
    
  4. 在排序表上运行让我在这里......现在帮助我

    > f(sort)
    $sort1
    $sort1$`1`
      ID Side Element Individual  H1 H2   U1   U2   R1   R2
    1  1 left       H          1 200 20 <NA> <NA> <NA> <NA>
    
    $sort1$`2`
      ID  Side Element Individual   H1   H2   U1   U2   R1   R2
    2  2 right       H          2  201   21 <NA> <NA> <NA> <NA>
    3  5  left       U          2 <NA> <NA>  300   30 <NA> <NA>
    
    $sort1$`5`
      ID  Side Element Individual   H1   H2  U1 U2   R1   R2
    4  6 right       U          5 <NA> <NA> 301 31 <NA> <NA>
    
    
    $sort2
       ID  Side Element Individual   H1   H2   U1   U2  R1 R2
    R3  3  left       R          2 <NA> <NA> <NA> <NA> 400 40
    R4  4 right       R          3 <NA> <NA> <NA> <NA> 401 41
    

    EDIT ATTEMPT#1我还把代码放在这个要点:

    https://gist.github.com/CarlBoneri/edd9ad9c89fdbf81a5ad87532228a8b0

    library(dplyr)
    library(jsonlite)
    #' Given the data frame `sort`,
    #' and privided variables of: `B1` and `B2` wherby `B1` represents an array
    #' of `Elements` to be matched and compared against from table `sort` given
    #' the outlying variable input of `B2` to find all unique pair-values of
    #' column vectors `H1:R2`
    #'
    #'
    #' ## TARGET ELEMENTS
    #' 1) Split the sort table into grouped tables, each returned item in the list
    #'    representing a unique `Element` variable from input `B1`
    #'
    # Setup by filtering matches of input `B1`
    target_chunk <- sort[grepl(paste0(B1,collapse="|"),sort$Element),]
    target_chunk
    ===  =====  =======  ==========  ===  ===  ===  ===  ===  ===
    ID   Side   Element  Individual  H1   H2   U1   U2   R1   R2 
    ===  =====  =======  ==========  ===  ===  ===  ===  ===  ===
    1    left   H        1           200  20   NA   NA   NA   NA 
    2    right  H        2           201  21   NA   NA   NA   NA 
    5    left   U        2           NA   NA   300  30   NA   NA 
    6    right  U        5           NA   NA   301  31   NA   NA 
    ===  =====  =======  ==========  ===  ===  ===  ===  ===  ===
    
    
    # Split on the individual
    target_list_ind <- split(target_chunk, target_chunk$Individual)
    target_list_ind
    $`1`
    ID Side Element Individual  H1 H2   U1   U2   R1   R2
    R1  1 left       H          1 200 20 <NA> <NA> <NA> <NA>
    
    $`2`
       ID  Side Element Individual   H1   H2   U1   U2   R1   R2
    R2  2 right       H          2  201   21 <NA> <NA> <NA> <NA>
    R5  5  left       U          2 <NA> <NA>  300   30 <NA> <NA>
    
    $`5`
       ID  Side Element Individual   H1   H2  U1 U2   R1   R2
    R6  6 right       U          5 <NA> <NA> 301 31 <NA> <NA>
    
    # Split each iteration of the `target_list_ind` on the Element
    target_list_elm <- sapply(target_list_ind, function(i){
      if(nrow(i)>1){
        split(i, i[['Element']])
      }else{
        i
      }
    })
    target_list_elm
    
    > target_list_elm
    $`1`
       ID Side Element Individual  H1 H2   U1   U2   R1   R2
    R1  1 left       H          1 200 20 <NA> <NA> <NA> <NA>
    
    $`2`
    $`2`$H
       ID  Side Element Individual  H1 H2   U1   U2   R1   R2
    R2  2 right       H          2 201 21 <NA> <NA> <NA> <NA>
    
    $`2`$U
       ID Side Element Individual   H1   H2  U1 U2   R1   R2
    R5  5 left       U          2 <NA> <NA> 300 30 <NA> <NA>
    
    
    $`5`
       ID  Side Element Individual   H1   H2  U1 U2   R1   R2
    R6  6 right       U          5 <NA> <NA> 301 31 <NA> <NA>
    
    #' 2) Set up our inner key table that represents the `B2` variable
    source_tbl <- sort[grepl(paste0(B2),sort$Element),]
    
    #' 3) The big loop here.. not sure if the result is what you are referencing
    #'
    element_l_df <- lapply(target_list_elm, function(i){
      if(is.data.frame(i)){
        f_src <- source_tbl[i[['Side']] != source_tbl$Side,]
        f_src <- f_src[i[['Element']] != f_src$Element,]
        source_el <- f_src[['Element']]
        target_el <- i
        source_vals <- f_src %>% select(-c(ID, Individual,Element,Side))
        target_vals <- i%>%select(-c(ID, Individual,Element,Side))
        var_bound <- cbind(source_vals[!mapply(is.na, source_vals)],
                           target_vals[!mapply(is.na, target_vals)])
        data.frame(individual = target_el[['Individual']],
                   source_element = source_el,
                   target_element = target_el[['Element']],
                   as.data.frame(var_bound))
    
    
      }else{
        ldply(1:length(i), function(x){
          f_src <- source_tbl[i[[x]][['Side']] != source_tbl$Side,]
          f_src <- f_src[i[[x]][['Element']] != f_src$Element,]
          source_el <- f_src[['Element']]
          target_el <- i[[x]]
          target_vals <- i[[x]]%>%select(-c(ID, Individual,Element,Side))
          source_vals <- f_src %>% select(-c(ID, Individual,Element,Side))
          var_bound <- data.frame(source_vals[!mapply(is.na, source_vals)],
                                  target_vals[!mapply(is.na, target_vals)])
          data.frame(individual = target_el[['Individual']],
                     source_element = source_el,
                     target_element = target_el[['Element']],
                     as.data.frame(var_bound))
    
    
        })
      }
    })
    
    element_l_df
    

    可能的结果:

    #' rbind.pages to put into 1 data frame
    #'
    rbind.pages(element_l_df)
    
    ==========  ==============  ==============  ===  ===  ===  ===  ===  ===
    individual  source_element  target_element  R1   R2   H1   H2   U1   U2 
    ==========  ==============  ==============  ===  ===  ===  ===  ===  ===
    1           R               H               401  41   200  20   NA   NA 
    2           R               H               400  40   201  21   NA   NA 
    2           R               U               401  41   NA   NA   300  30 
    5           R               U               400  40   NA   NA   301  31 
    ==========  ==============  ==============  ===  ===  ===  ===  ===  ===
    

    编辑2

    A <- melt(rbind.pages(element_l_df), c("source_element","target_element", "individual"))
    > head(A)
      source_element target_element individual variable value
    1              R              H          1       R1   401
    2              R              H          2       R1   400
    3              R              U          2       R1   401
    4              R              U          5       R1   400
    5              R              H          1       R2    41
    6              R              H          2       R2    40
    

    这是你追求的结果吗?

    评论编辑

    Okay this was the model, ill circle back on this tom