如何将多列转换为观察

时间:2015-09-18 15:09:32

标签: r dataframe reshape reshape2 tidyr

我有一个这样的数据框:

structure(list(one = structure(1:4, .Label = c("a", "b", "c", 
"d"), class = "factor"), two = c(2, 4, 7, 3), x.1 = c("x1a", 
"x1b", "x1c", "x1d"), x.2 = c("x2a", "x2b", "x2c", "x2d"), x.3 = c("x3a", 
"x3b", "x3c", "x3d"), y.1 = c(NA, "y1b", "y1c", NA), y.2 = c(NA, 
"y2b", "y2c", NA), y.3 = c(NA, "y3b", "y3c", NA)), .Names = c("one", 
"two", "x.1", "x.2", "x.3", "y.1", "y.2", "y.3"), row.names = c(NA, 
-4L), class = "data.frame")

如您所见,每个事件a,b,c和d(变量" one")的观察结果存储为列,其中x和y定义单独的观察值,1,2和3定义变量。变量"两个"这里没有意义。

我喜欢重新塑造这个数据框,让它以每个观察者拥有它自己的行和每个变量所拥有的列的形式整洁。

最终数据框应如下所示:

structure(list(one = structure(c(1L, 2L, 2L, 3L, 3L, 4L), .Label = c("a", 
"b", "c", "d"), class = "factor"), two = c(2, 4, 2, 7, 5, 3), 
var1 = c("x1a", "x1b", "y1b", "x1c", "y1c", "x1d"), var2 = c("x2a", 
"x2b", "y2b", "x2c", "y2c", "x2d"), var3 = c("x3a", "x3b", 
"y3b", "x3c", "y3c", "x3d")), .Names = c("one", "two", "var1", 
"var2", "var3"), row.names = c(1L, 2L, 5L, 3L, 6L, 4L), class = "data.frame")

我对重塑包的演员和融合功能有点熟悉,但还没有找到一种以智能方式重塑DF的方法。 现在,以下内容提供了我已经达到的状态:

df.between <- melt(df.in, id.vars=c("one", "two"))
df.between$variable <- gsub("x.|y.", "", df.between$variable)

现在&#34;变量&#34;列确实正确识别变量(1,2或3)。但是,我无法将其转换为所需的格式,由于使用了grepl,此解决方案对于较大的数据集似乎没有用。

很高兴在这里轻推方向。

3 个答案:

答案 0 :(得分:5)

我们可以使用melt的devel版本中的data.table,即v1.9.5,它可以为patterns变量处理多个measure

library(data.table)
melt(setDT(df1), measure=patterns('.1', '.2', '.3'),
      na.rm=TRUE, value.name=paste0('var', 1:3))[, variable:=NULL][order(one)]
#   one two var1 var2 var3
#1:   a   2  x1a  x2a  x3a
#2:   b   4  x1b  x2b  x3b
#3:   b   4  y1b  y2b  y3b
#4:   c   7  x1c  x2c  x3c
#5:   c   7  y1c  y2c  y3c
#6:   d   3  x1d  x2d  x3d

编辑:我们在c内不需要patterns,它也会提供完全匹配(来自@ Jaap&#39的评论)。

答案 1 :(得分:3)

melt来自&#34; data.table&#34;将比以下更快,但您也可以考虑我的&#34; splitstackshape&#34; merged.stack。包:

library(splitstackshape)
na.omit(merged.stack(mydf, var.stubs = c(".1", ".2", ".3"),
                     sep = "var.stubs", atStart = FALSE))

#    one two .time_1  .1  .2  .3
# 1:   a   2       x x1a x2a x3a
# 2:   b   4       x x1b x2b x3b
# 3:   b   4       y y1b y2b y3b
# 4:   c   7       x x1c x2c x3c
# 5:   c   7       y y1c y2c y3c
# 6:   d   3       x x1d x2d x3d

答案 2 :(得分:2)

你几乎在那里有重塑路线,所以我为你完成了它。您所需要的只是区分x和y变量。 (如果您不想要或不需要它们,以后很容易删除)。我留下了遗漏,因为它们很容易删除,并防止无声删除丢失的数据。

df.between <- melt(df.in, id.vars=c("one", "two"))
#replace with 'var' so no numeric column names.
df.between$variable_n <- gsub("x.|y.", "var", df.between$variable)
df.between$variable_xy <- gsub(".[0-9]","",df.between$variable)

res <- dcast(one+two+variable_xy~variable_n,value.var="value",data=df.between)

    > res
  one two variable_xy var1 var2 var3
1   a   2           x  x1a  x2a  x3a
2   a   2           y <NA> <NA> <NA>
3   b   4           x  x1b  x2b  x3b
4   b   4           y  y1b  y2b  y3b
5   c   7           x  x1c  x2c  x3c
6   c   7           y  y1c  y2c  y3c
7   d   3           x  x1d  x2d  x3d
8   d   3           y <NA> <NA> <NA>