从嵌套条目创建数据框

时间:2020-10-20 17:43:25

标签: r

我有一个像这样的数据框test

dput(test)
structure(list(X = 1L, entityId = structure(1L, .Label = "HOST-123", class = "factor"), 
    displayName = structure(1L, .Label = "server1", class = "factor"), 
    discoveredName = structure(1L, .Label = "server1", class = "factor"), 
    firstSeenTimestamp = 1593860000000, lastSeenTimestamp = 1603210000000, 
    tags = structure(1L, .Label = "c(\"CONTEXTLESS\", \"CONTEXTLESS\", \"CONTEXTLESS\", \"CONTEXTLESS\", \"CONTEXTLESS\", \"CONTEXTLESS\", \"CONTEXTLESS\", \"CONTEXTLESS\"), c(\"app1\", \"client\", \"org\", \"app1\", \"DATA_CENTER\", \"PURPOSE\", \"REGION\", \"Test\"), c(NA, \"NONE\", \"Host:Environment:test123\", \"111\", \"222\", \"GENERAL\", \"444\", \"555\")", class = "factor")), .Names = c("X", 
"entityId", "displayName", "discoveredName", "firstSeenTimestamp", 
"lastSeenTimestamp", "tags"), class = "data.frame", row.names = c(NA, 
-1L))

有一列称为tags的列应成为数据框。我需要摆脱标记中的第一行(它一直说:CONTEXTLESS,在标记中扩展第二列(使它们成为列。最后,我需要在每个扩展列下的标记中插入第三列值。

例如in需要看起来像这样:

structure(list(entityId = structure(1L, .Label = "HOST-123", class = "factor"), 
    displayName = structure(1L, .Label = "server1", class = "factor"), 
    discoveredName = structure(1L, .Label = "server1", class = "factor"), 
    firstSeenTimestamp = 1593860000000, lastSeenTimestamp = 1603210000000, 
    app1 = NA, client = structure(1L, .Label = "None", class = "factor"), 
    org = structure(1L, .Label = "Host:Environment:test123", class = "factor"), 
    app1.1 = 111L, data_center = 222L, purppose = structure(1L, .Label = "general", class = "factor"), 
    region = 444L, test = 555L), .Names = c("entityId", "displayName", 
"discoveredName", "firstSeenTimestamp", "lastSeenTimestamp", 
"app1", "client", "org", "app1.1", "data_center", "purppose", 
"region", "test"), class = "data.frame", row.names = c(NA, -1L
))

我需要删除一直说“ contextless”的第一个向量,将第二个向量添加到列中。每个第二矢量值应为列名。最后一个向量应该是新添加的列的值。

2 个答案:

答案 0 :(得分:1)

如果您愿意丢弃第一个“行”的垃圾,然后仔细分析解析的副作用,那么这可能是一个不错的起点:

read.table(text=gsub("\\),", ")\n", test$tags[1]), sep=",", skip=1, #drops line
                      header=TRUE)

  c.app1 client                       org app1 DATA_CENTER  PURPOSE REGION Test.
1   c(NA   NONE  Host:Environment:test123  111         222  GENERAL    444  555)

read.table函数使用scan函数,该函数不知道“ c(”和“)”是有意义的。另一种选择是在第二行和第三行尝试eval(parse(text= .))(它将知道它们包含向量),但是我看不到一种干净的方法。最初,我尝试使用strsplit分隔行,但这导致我松开了括号。

通过添加更多的gsub操作,在清理过程中遇到了麻烦:

read.table(text=gsub("c\\(|\\)","", # gets rid of enclosing "c(" and ")"
                     gsub("\\),", "\n", # inserts line breaks
                                 test$tags[1])), 
                                 sep=",",     #lets commas be parsed
                                 skip=1,      #drops line
                                 header=TRUE) # converts to colnames

  app1 client                       org app1.1 DATA_CENTER  PURPOSE REGION Test
1   NA   NONE  Host:Environment:test123    111         222  GENERAL    444  555

在app1的第二个实例中添加“ .1”的原因是,除非使用check.names=FALSE覆盖数据帧中的R同名,否则它们必须是唯一的。

答案 1 :(得分:1)

这是一种tidyverse方法

library(dplyr)
library(tidyr)

str2dataframe <- function(txt, keep = "all") {
  # If you can confirm that all vectors are of the same length, then we can make them into columns of a data.frame
  out <- eval(parse(text = paste0("data.frame(", as.character(txt),")")))
  # rename columns as X1, X2, ...
  nms <- make.names(seq_along(out), unique = TRUE)
  if (keep == "all")
    keep <- nms
  `names<-`(out, nms)[, keep]
}

df %>% 
  mutate(
    tags = lapply(tags, str2dataframe, -1L), 
    tags = lapply(tags, function(d) within(d, X2 <- make.unique(X2)))
  ) %>% 
  unnest(tags) %>% 
  pivot_wider(names_from = "X2", values_from = "X3")

df看起来像这样

> df
  X entityId displayName discoveredName firstSeenTimestamp lastSeenTimestamp
1 1 HOST-123     server1        server1        1.59386e+12       1.60321e+12
                                                                                                                                                                                                                                                                                         tags
1 c("CONTEXTLESS", "CONTEXTLESS", "CONTEXTLESS", "CONTEXTLESS", "CONTEXTLESS", "CONTEXTLESS", "CONTEXTLESS", "CONTEXTLESS"), c("app1", "client", "org", "app1", "DATA_CENTER", "PURPOSE", "REGION", "Test"), c(NA, "NONE", "Host:Environment:test123", "111", "222", "GENERAL", "444", "555")

输出看起来像这样

# A tibble: 1 x 14
      X entityId displayName discoveredName firstSeenTimestamp lastSeenTimestamp app1  client org                      app1.1 DATA_CENTER PURPOSE REGION Test 
  <int> <fct>    <fct>       <fct>                       <dbl>             <dbl> <chr> <chr>  <chr>                    <chr>  <chr>       <chr>   <chr>  <chr>
1     1 HOST-123 server1     server1             1593860000000     1603210000000 NA    NONE   Host:Environment:test123 111    222         GENERAL 444    555  
相关问题