我有一个如下所示的数据框,其中包含x
列中的逗号和y
:
df <- data.frame(var1=letters[1:5], var2=letters[6:10], var3=1:5, x=c('apple','orange,apple', 'grape','apple,orange,grape','cherry,peach'), y=c('wine', 'wine', 'juice', 'wine,beer,juice', 'beer,juice'))
df
var1 var2 var3 x y
1 a f 1 apple wine
2 b g 2 orange,apple wine
3 c h 3 grape juice
4 d i 4 apple,orange,grape wine,beer,juice
5 e j 5 cherry,peach beer,juice
让它看起来像这样的最简单的方法是什么:
dfnew
var1 var2 var3 x y
a f 1 apple wine
b g 2 orange wine
b g 2 apple NA
c h 3 grape juice
d i 4 apple wine
d i 4 orange beer
d i 4 grape juice
e j 5 cherry beer
e j 5 peach juice
我见过类似的问题,但在我的例子中使用3列时,我的真实数据有很多。我需要的东西会占据所有列,但x
&amp; y
并复制,然后将&#34;,&#34;表格形式,如我期望的结果。
答案 0 :(得分:3)
在原始data.frame中,x
中的列表元素与相同行中的y
之间存在1:1的关系。因此,在拆分后,x
和y
中的元素数量相同。这种“对称”结构允许我们同时分割两列:
# original data.frame, "symmetric" data
df1 <- data.frame(var1=letters[1:5], var2=letters[6:10], var3=1:5,
x=c('apple','orange,apple', 'grape','apple,orange,grape','cherry,peach'),
y=c('wine', 'wine,beer', 'juice', 'wine,beer,juice', 'beer,juice'))
library(data.table) # CRAN version 1.10.4 used
# define columns to be splitted
sp_col <- c("x", "y")
# define id columns
id_col <- paste0("var", 1:3)
# coerce to class data.table,
# convert sp_col from factor to character which is required by strsplit(),
# then split up all columns _not_ used for grouping,
# turn the result into vectors, but for each column separately.
setDT(df1)[, (sp_col) := lapply(.SD, as.character), .SDcols = sp_col][
, unlist(lapply(.SD, strsplit, split = ",", fixed = TRUE), recursive = FALSE), by = id_col]
产生
var1 var2 var3 x y
1: a f 1 apple wine
2: b g 2 orange wine
3: b g 2 apple beer
4: c h 3 grape juice
5: d i 4 apple wine
6: d i 4 orange beer
7: d i 4 grape juice
8: e j 5 cherry beer
9: e j 5 peach juice
编辑:使用已编辑的data.frame,OP已请求按NA填写缺失的位置,这需要采用不同的方法。为此,使用了melt()
和dcast()
。
# data.frame updated by OP, "unsymmetric" data
df2 <- data.frame(var1=letters[1:5], var2=letters[6:10], var3=1:5,
x=c('apple','orange,apple', 'grape','apple,orange,grape','cherry,peach'),
y=c('wine', 'wine', 'juice', 'wine,beer,juice', 'beer,juice'))
请注意第y
栏第2行的更改。
library(data.table) # CRAN version 1.10.4 used
# define columns to be splitted
sp_col <- c("x", "y")
# coerce to class data.table, add column with row numbers
# reshape from wide to long format
long <- melt(setDT(df2)[, rn := .I], measure.vars = sp_col)
# split value column, grouped by all other columns
# reshape from long to wide format where the rows are formed by
# an individual count by row number and variable + all other id cols,
# finally remove the row numbers as this is no longer needed
dcast(long[, strsplit(value, ",", fixed = TRUE), by = setdiff(names(long), "value")],
... + rowid(rn, variable) ~ variable , value.var = "V1")[
, rn := NULL][]
(感谢@Jaap建议改进)
生成要求的NAs:
var1 var2 var3 x y
1: a f 1 apple wine
2: b g 2 orange wine
3: b g 2 apple NA
4: c h 3 grape juice
5: d i 4 apple wine
6: d i 4 orange beer
7: d i 4 grape juice
8: e j 5 cherry beer
9: e j 5 peach juice
答案 1 :(得分:2)
基础R的解决方案:
# split the 'x' & 'y' columns in lists
xl <- strsplit(as.character(df$x), ',')
yl <- strsplit(as.character(df$y), ',')
# get the maximum length of the strings for each row
reps <- pmax(lengths(xl), lengths(yl))
# replicate the rows of 'df' by the vector of maximum string lengths
df2 <- df[rep(1:nrow(df), reps), 1:3]
# add NA-values for when the length of the strings in 'df' is shorter than
# the maximum length (which is stored in the 'reps'-vector)
# unlist & add to 'df2'
df2$x <- unlist(mapply(function(x,y) c(x, rep(NA, y)), xl, reps - lengths(xl)))
df2$y <- unlist(mapply(function(x,y) c(x, rep(NA, y)), yl, reps - lengths(yl)))
给出:
> df2 var1 var2 var3 x y 1 a f 1 apple wine 2 b g 2 orange wine 2.1 b g 2 apple <NA> 3 c h 3 grape juice 4 d i 4 apple wine 4.1 d i 4 orange beer 4.2 d i 4 grape juice 5 e j 5 cherry beer 5.1 e j 5 peach juice