Question

我有一个我要在R中分析的另一个程序的输出。但是，由于它的编写方式而我想要在data.frame中对其进行转换，因此输出太难以分析。我在下面解释。

我有很多具有以下模式的向量：

[["tree:" 3 "bromeliad:" 326 "local:" "canopy" 698 221 "height:" 5 "origin:" "seed" "ancestry:" [34] "alleles:" 167 167 169 169 208 208 267 267 268 268 233 233] ["tree:" 3 "bromeliad:" 538 "local:" "canopy" 748 187 "height:" 8 "origin:" "seed" "ancestry:" [34] "alleles:" 167 167 169 169 214 214 267 267 268 268 233 233] ["tree:" 3 "bromeliad:" 481 "local:" "canopy" 670 194 "height:" 8 "origin:" "seed" "ancestry:" [34] "alleles:" 167 167 169 169 208 208 267 267 268 268 233 233] ["tree:" 4 "bromeliad:" 412 "local:" "canopy" 701 206 "height:" 6 "origin:" "seed" "ancestry:" [34] "alleles:" 167 167 169 169 208 208 267 267 268 268 233 233] ["tree:" 4 "bromeliad:" 843 "local:" "canopy" 742 197 "height:" 6 "origin:" "seed" "ancestry:" [34] "alleles:" 167 167 169 169 208 214 267 267 268 268 233 233] ["tree:" 5 "bromeliad:" 473 "local:" "canopy" 714 169 "height:" 7 "origin:" "seed" "ancestry:" [34] "alleles:" 167 167 169 169 208 208 267 267 268 268 233 233]]

解释上述输出：它有一个“[”分隔整个数据，另一个“[”分隔每一行。每列的名称在“”之间，并且具有“：”。它重复每一行。我们有字符和数字变量。最后12个变量是指基因型，其中我们有6个微卫星的两个等位基因（因此，它们可以合并在同一个细胞中或被分开）。

从这个例子中，我试图将data.frame作为以下示例：

tree    bromeliad   local   x   y   height  origin  ancestry    locus.1 locus.2 locus.3 locus.4 locus.5 locus.6
3   326 canopy  698 221 5   seed    34  167167  169169  208208  267267  268268  233233
3   538 canopy  748 187 8   seed    34  167167  169169  214214  267267  268268  233233
3   481 canopy  670 194 8   seed    34  167167  169169  208208  267267  268268  233233
4   412 canopy  701 206 6   seed    34  167167  169169  208208  267267  268268  233233
4   843 canopy  742 197 6   seed    34  167167  169169  208208  267267  268268  233233
5   473 canopy  714 169 7   seed    34  167167  169169  208208  267267  268268  233233

我认为有更多了解编程语言的人可以比我更好地应对这一挑战。你能帮助我吗？ =）

Answer 1

我认为这会让你得到你想要的东西。基因型作为载体存储在列表列中。不可否认，这不是很强大 - 如果列不完全如所描述的那样会失败 - 按顺序，或者内容中有任何符号（包括小数）。如果你有小数，我需要用正则表达式部分来表达一点。）

output <- str_split(input, "\\] \\[") %>%
  unlist(recursive = FALSE) %>%
  str_match_all( "\\w+") %>%
  map_dfr(.f = function(vec) {
    data_frame(
      tree = vec[2] %>% as.integer(),
      bromeliad = vec[4] %>% as.integer(),
      local = vec[6],
      x = vec[7] %>% as.integer(),
      y = vec[8] %>% as.integer(),
      height = vec[10] %>% as.integer(),
      origin = vec[12],
      ancestery = vec[14] %>% as.integer(),
      locus = list(locus=vec[16:27] %>% as.integer())
    )
  })

Answer 2

我试着用Melissa的方式做，但我不知道为什么它不起作用。但是，基于它，我改变了我的代码，如下所示：

pop_list <- list()
for (i in 1:length(genots_list)){
  input <- genots_list[[i]]
  input2 <- str_split(input, "\\] \\[")
  input3 <- unlist(input2, recursive = F)
  df <- data.frame(matrix(nrow = 0,ncol = 14))
  for (j in 1:length(input3)){
    input4 <- unlist(strsplit(noquote(input3[[j]])," "))
    df <- rbind(df,data.frame(
      tree = as.integer(input4[2]),
      bromeliad = as.integer(input4[4]),
      local = noquote(input4[6]),
      x = as.integer(input4[7]),
      y = as.integer(input4[8]),
      height = as.integer(input4[10]),
      origin = noquote(input4[12]),
      ancestry = gsub("\\[|\\]", "", input4[14]),
      locus_1 = as.integer(paste(as.integer(input4[16]),as.integer(input4[17]),sep="")),
      locus_2 = as.integer(paste(as.integer(input4[18]),as.integer(input4[19]),sep="")),
      locus_3 = as.integer(paste(as.integer(input4[20]),as.integer(input4[21]),sep="")),
      locus_4 = as.integer(paste(as.integer(input4[22]),as.integer(input4[23]),sep="")),
      locus_5 = as.integer(paste(as.integer(input4[24]),as.integer(input4[25]),sep="")),
      locus_6 = as.integer(paste(as.integer(input4[26]),as.integer(gsub("\\[|\\]", "", input4[27])),sep=""))
      )
    )
  }
  pop_list[[i]] <- df
}

我是开头时给出的名单成员。

如何从R中包含多个字符串的向量创建data.frame？

2 个答案: