如何基于多列字符串中的字符复制行

时间:2017-06-16 23:34:47

标签: r dataframe split

我有一个如下所示的数据框,其中包含x列中的逗号和y

df <- data.frame(var1=letters[1:5], var2=letters[6:10], var3=1:5, x=c('apple','orange,apple', 'grape','apple,orange,grape','cherry,peach'), y=c('wine', 'wine', 'juice', 'wine,beer,juice', 'beer,juice'))

df
  var1 var2 var3                  x               y
1    a    f    1              apple            wine
2    b    g    2       orange,apple            wine
3    c    h    3              grape           juice
4    d    i    4 apple,orange,grape wine,beer,juice
5    e    j    5       cherry,peach      beer,juice

让它看起来像这样的最简单的方法是什么:

dfnew                   
    var1    var2    var3    x       y
    a       f       1       apple   wine
    b       g       2       orange  wine
    b       g       2       apple   NA
    c       h       3       grape   juice
    d       i       4       apple   wine
    d       i       4       orange  beer
    d       i       4       grape   juice
    e       j       5       cherry  beer
    e       j       5       peach   juice

我见过类似的问题,但在我的例子中使用3列时,我的真实数据有很多。我需要的东西会占据所有列,但x&amp; y并复制,然后将&#34;,&#34;表格形式,如我期望的结果。

2 个答案:

答案 0 :(得分:3)

在原始data.frame中,x中的列表元素与相同行中的y之间存在1:1的关系。因此,在拆分后,xy中的元素数量相同。这种“对称”结构允许我们同时分割两列:

# original data.frame, "symmetric" data
df1 <- data.frame(var1=letters[1:5], var2=letters[6:10], var3=1:5, 
                  x=c('apple','orange,apple', 'grape','apple,orange,grape','cherry,peach'), 
                  y=c('wine', 'wine,beer', 'juice', 'wine,beer,juice', 'beer,juice'))

library(data.table)   # CRAN version 1.10.4 used
# define columns to be splitted
sp_col <- c("x", "y")
# define id columns
id_col <- paste0("var", 1:3)
# coerce to class data.table, 
# convert sp_col from factor to character which is required by strsplit(),
# then split up all columns _not_ used for grouping,
# turn the result into vectors, but for each column separately. 
setDT(df1)[, (sp_col) := lapply(.SD, as.character), .SDcols = sp_col][
  , unlist(lapply(.SD, strsplit, split = ",", fixed = TRUE), recursive = FALSE), by = id_col]

产生

   var1 var2 var3      x     y
1:    a    f    1  apple  wine
2:    b    g    2 orange  wine
3:    b    g    2  apple  beer
4:    c    h    3  grape juice
5:    d    i    4  apple  wine
6:    d    i    4 orange  beer
7:    d    i    4  grape juice
8:    e    j    5 cherry  beer
9:    e    j    5  peach juice

编辑:使用已编辑的data.frame,OP已请求按NA填写缺失的位置,这需要采用不同的方法。为此,使用了melt()dcast()

# data.frame updated by OP, "unsymmetric" data
df2 <- data.frame(var1=letters[1:5], var2=letters[6:10], var3=1:5, 
                  x=c('apple','orange,apple', 'grape','apple,orange,grape','cherry,peach'), 
                  y=c('wine', 'wine', 'juice', 'wine,beer,juice', 'beer,juice'))

请注意第y栏第2行的更改。

library(data.table)   # CRAN version 1.10.4 used
# define columns to be splitted
sp_col <- c("x", "y")
# coerce to class data.table, add column with row numbers
# reshape from wide to long format
long <- melt(setDT(df2)[, rn := .I], measure.vars = sp_col)
# split value column, grouped by all other columns
# reshape from long to wide format where the rows are formed by
# an individual count by row number and variable + all other id cols,
# finally remove the row numbers as this is no longer needed
dcast(long[, strsplit(value, ",", fixed = TRUE), by = setdiff(names(long), "value")], 
      ... + rowid(rn, variable) ~ variable , value.var = "V1")[
        , rn := NULL][]

(感谢@Jaap建议改进)

生成要求的NAs:

   var1 var2 var3      x     y
1:    a    f    1  apple  wine
2:    b    g    2 orange  wine
3:    b    g    2  apple    NA
4:    c    h    3  grape juice
5:    d    i    4  apple  wine
6:    d    i    4 orange  beer
7:    d    i    4  grape juice
8:    e    j    5 cherry  beer
9:    e    j    5  peach juice

答案 1 :(得分:2)

基础R的解决方案:

# split the 'x' & 'y' columns in lists
xl <- strsplit(as.character(df$x), ',')
yl <- strsplit(as.character(df$y), ',')

# get the maximum length of the strings for each row
reps <- pmax(lengths(xl), lengths(yl))

# replicate the rows of 'df' by the vector of maximum string lengths
df2 <- df[rep(1:nrow(df), reps), 1:3]

# add NA-values for when the length of the strings in 'df' is shorter than
# the maximum length (which is stored in the 'reps'-vector)
# unlist & add to 'df2'
df2$x <- unlist(mapply(function(x,y) c(x, rep(NA, y)), xl, reps - lengths(xl)))
df2$y <- unlist(mapply(function(x,y) c(x, rep(NA, y)), yl, reps - lengths(yl)))

给出:

> df2
    var1 var2 var3      x     y
1      a    f    1  apple  wine
2      b    g    2 orange  wine
2.1    b    g    2  apple  <NA>
3      c    h    3  grape juice
4      d    i    4  apple  wine
4.1    d    i    4 orange  beer
4.2    d    i    4  grape juice
5      e    j    5 cherry  beer
5.1    e    j    5  peach juice