取消列表向量并从现有数据框创建数据框

时间:2016-02-01 18:25:20

标签: r dataframe

我正在尝试从现有数据框架创建一个新的数据框架,格式如下。 数据框(df)的格式为

   A           B                                    C
   london   c("Kompast", "Kirklan", "Com")    c("April 1989- June 1990", "July 1990-May 2000", "May 2000-July 2012")
   sydney   c("kkj", "krr")                   c("April 1990-May 2000", "May 2000-March 2012")
   newyork  Coml                              c("April 1990- May 2013", "2 years")
   chicago   NULL                              NULL

我需要取消列出数据框并将其作为行获取,如下所示:

A        B             C
london  Kompast April 1989- June 1990
london  Kirklan July 1990-May 2000
london  Com     May 2000-July 2012

有什么建议吗?

1 个答案:

答案 0 :(得分:1)

如评论中所示,您可以查看目前位于this GitHub GistflattenflattenLong个功能(并在下方重新创建)。

首先,这是一些示例数据。 df_1在列中的每个列表中具有平衡数量的项目" B"和" C"以及一个NULL的项目。另一方面,df_2每个列表列中的项目数量不平衡,其中散布着NULL

df_1 <- data.frame(
  A = c("london", "sydney", "new york", "chicago"),
  B = I(list(letters[1:3], letters[4:5], letters[6], NULL)),
  C = I(list(LETTERS[1:3], LETTERS[4:5], LETTERS[6], NULL))
)
df_1
#          A       B       C
# 1   london a, b, c A, B, C
# 2   sydney    d, e    D, E
# 3 new york       f       F
# 4  chicago                

df_2 <- data.frame(
  A = c("london", "sydney", "new york", "chicago"),
  B = I(list(letters[1:3], letters[4:5], letters[6], NULL)),
  C = I(list(LETTERS[1:2], NULL, LETTERS[3:5], LETTERS[6:7]))
)
df_2
#          A       B       C
# 1   london a, b, c    A, B
# 2   sydney    d, e        
# 3 new york       f C, D, E
# 4  chicago            F, G

以下是Gist的相关功能:

flatten <- function(indt, cols, drop = FALSE) {
  require(data.table)
  if (!is.data.table(indt)) indt <- as.data.table(indt)
  x <- unlist(indt[, lapply(.SD, function(x) max(lengths(x))), .SDcols = cols])
  nams <- paste(rep(cols, x), sequence(x), sep = "_")
  indt[, (nams) := unlist(lapply(.SD, transpose), recursive = FALSE), .SDcols = (cols)]
  if (isTRUE(drop)) indt[, (cols) := NULL]
  indt[]
}

flattenLong <- function(indt, cols) {
  ob <- setdiff(names(indt), cols)
  x <- flatten(indt, cols, TRUE)
  mv <- lapply(cols, function(y) grep(sprintf("^%s_", y), names(x)))
  setorderv(melt(x, measure.vars = mv, value.name = cols), ob)[]
}

最后,这是用法。请注意,您可以通过获取最大长度列表项并将其乘以现有行数来计算出预期的行数。在这种情况下,3 x 4 = 12行。

此处df_1

flattenLong(df_1, c("B", "C"))
#            A variable  B  C
#  1:  chicago        1 NA NA
#  2:  chicago        2 NA NA
#  3:  chicago        3 NA NA
#  4:   london        1  a  A
#  5:   london        2  b  B
#  6:   london        3  c  C
#  7: new york        1  f  F
#  8: new york        2 NA NA
#  9: new york        3 NA NA
# 10:   sydney        1  d  D
# 11:   sydney        2  e  E
# 12:   sydney        3 NA NA

此处df_2

flattenLong(df_2, c("B", "C"))
#            A variable  B  C
#  1:  chicago        1 NA  F
#  2:  chicago        2 NA  G
#  3:  chicago        3 NA NA
#  4:   london        1  a  A
#  5:   london        2  b  B
#  6:   london        3  c NA
#  7: new york        1  f  C
#  8: new york        2 NA  D
#  9: new york        3 NA  E
# 10:   sydney        1  d NA
# 11:   sydney        2  e NA
# 12:   sydney        3 NA NA

而且,作为奖励,如果你更喜欢&#34;宽&#34;格式,可以直接使用flatten(由flattenLong调用,如功能代码中所示)。

flatten(df_1, c("B", "C"))
#           A     B     C B_1 B_2 B_3 C_1 C_2 C_3
# 1:   london a,b,c A,B,C   a   b   c   A   B   C
# 2:   sydney   d,e   D,E   d   e  NA   D   E  NA
# 3: new york     f     F   f  NA  NA   F  NA  NA
# 4:  chicago  NULL  NULL  NA  NA  NA  NA  NA  NA