Question

我试图使用ggplot2绘制R中的内置anscombe数据集（其中包含四个不同的小数据集，这些数据集具有相同的相关性，但X和Y之间的关系完全不同）。我试图正确地重塑数据都非常难看。我使用了reshape2和基础R的组合; Hadleyverse 2（tidyr / dplyr）或data.table解决方案对我没问题，但理想的解决方案是

简短/无重复代码
易于理解（与标准＃1有些冲突）
尽可能少地编写列号等硬编码

原始格式：

 anscombe
 ##     x1 x2 x3 x4    y1   y2   y3     y4
 ##  1  10 10 10  8  8.04 9.14  7.46  6.58
 ##  2   8  8  8  8  6.95 8.14  6.77  5.76
 ##  3  13 13 13  8  7.58 8.74 12.74  7.71
 ## ...
 ## 11   5  5  5  8  5.68 4.74  5.73  6.89

所需格式：

 ##    s  x    y
 ## 1  1 10 8.04
 ## 2  1  8 6.95
 ## ...
 ## 44 4  8 6.89

这是我的尝试：

 library("reshape2")
 ff <- function(x,v) 
     setNames(transform(
        melt(as.matrix(x)),
             v1=substr(Var2,1,1),
             v2=substr(Var2,2,2))[,c(3,5)],
          c(v,"s"))
 f1 <- ff(anscombe[,1:4],"x")
 f2 <- ff(anscombe[,5:8],"y")
 f12 <- cbind(f1,f2)[,c("s","x","y")]

现在情节：

 library("ggplot2"); theme_set(theme_classic())
 th_clean <- 
  theme(panel.margin=grid::unit(0,"lines"),
    axis.ticks.x=element_blank(),
    axis.text.x=element_blank(),
    axis.ticks.y=element_blank(),
    axis.text.y=element_blank()
    )
ggplot(f12,aes(x,y))+geom_point()+
  facet_wrap(~s)+labs(x="",y="")+
  th_clean

Answer 1

如果您真的在处理“anscombe”数据集，那么我会说@Thela的reshape解决方案非常直接。

但是，这里还有一些其他选择：

选项1：基础R

您可以编写自己的“重塑”功能，可能是这样的：

myReshape <- function(indf = anscombe, stubs = c("x", "y")) {
  temp <- sapply(stubs, function(x) {
    unlist(indf[grep(x, names(indf))], use.names = FALSE)
  })
  s <- rep(seq_along(grep(stubs[1], names(indf))), each = nrow(indf))
  data.frame(s, temp)
}

注意：

我不确定这肯定不如你正在做的那么笨重
如果数据“不平衡”（例如，“x”列多于“y”列），则此方法无效。

选项2：“dplyr”+“tidyr”

由于管道风靡一时，你也可以尝试：

library(dplyr)
library(tidyr)

anscombe %>%
  gather(var, val, everything()) %>%
  extract(var, into = c("variable", "s"), "(.)(.)") %>% 
  group_by(variable, s) %>%
  mutate(ind = sequence(n())) %>%
  spread(variable, val)

注意：

我不确定这一点是不是比你现在做的那么笨重，但有些人喜欢管道方法。
这种方法应该能够处理不平衡的数据。

选项3：“splitstackshape”

在@Arun去melt.data.table之前做了所有精彩的工作之前，我已经在我的“splitstackshape”包中写了merged.stack。有了这个，方法将是：

library(splitstackshape)
setnames(
  merged.stack(
    data.table(anscombe, keep.rownames = TRUE), 
               var.stubs = c("x", "y"), sep = "var.stubs"), 
  ".time_1", "s")[]

一些注意事项：

merged.stack需要将某些内容视为“id”，因此需要data.table(anscombe, keep.rownames = TRUE)，其中添加了一个名为“rn”的列，其中包含行号
sep = "var.stubs"基本上意味着我们没有真正的分隔符变量，因此我们只需删除存根并使用剩余的“时间”变量
merged.stack将起作用。例如，尝试使用anscombe2 <- anscombe[1:7]作为数据集而不是“anscombe”。
同一个软件包还有一个名为Reshape的函数，它基于reshape构建，让它重塑不平衡数据。但它比merged.stack更慢，更灵活。基本方法是Reshape(data.table(anscombe, keep.rownames = TRUE), var.stubs = c("x", "y"), sep = "")，然后使用setnames重命名“时间”变量。

选项4：`melt.data.table`

上面的评论中提到了这一点，但尚未作为答案分享。在基数R reshape之外，这是一种非常直接的方法，可以在函数本身内处理列重命名：

library(data.table)
melt(as.data.table(anscombe), 
     measure.vars = patterns(c("x", "y")), 
     value.name=c('x', 'y'), 
     variable.name = "s")

注意：

会非常快。
比“splitstackshape”或reshape; - ）
处理不平衡的数据就好了。

Answer 2

我认为这符合以下标准：1）简短2）可理解和3）没有硬编码的列号。并且它不需要任何其他包。

reshape(anscombe, varying=TRUE, sep="", direction="long", timevar="s")

#     s  x     y id
#1.1  1 10  8.04  1
#...
#11.1 1  5  5.68 11
#1.2  2 10  9.14  1
#...
#11.2 2  5  4.74 11
#1.3  3 10  7.46  1
#...
#11.3 3  5  5.73 11
#1.4  4  8  6.58  1
#...
#11.4 4  8  6.89 11

Answer 3

我不知道是否可以接受非重塑解决方案，但是你可以去：

library(data.table)
#create the pattern that will have the Xs
#this will make it easy to create the Ys
pattern <- 1:4
#use Map to create a list of data.frames with the needed columns
#and also use rbindlist to rbind the list produced by Map
lists <- rbindlist(Map(data.frame, 
                       pattern,
                       anscombe[pattern], 
                       anscombe[pattern+length(pattern)]
                       )
                   )
#set the correct names
setnames(lists, names(lists), c('s','x','y'))

输出：

> lists
    s  x     y
 1: 1 10  8.04
 2: 1  8  6.95
 3: 1 13  7.58
 4: 1  9  8.81
 5: 1 11  8.33
 6: 1 14  9.96
 7: 1  6  7.24
 8: 1  4  4.26
 9: 1 12 10.84
10: 1  7  4.82
....

Answer 4

tidyverse vignette 中建议使用更新的 tidyverse 选项：

anscombe %>% 
  pivot_longer(everything(), 
    names_to = c(".value", "set"), 
    names_pattern = "(.)(.)"
  ) %>% 
  arrange(set)
#> # A tibble: 44 x 3
#>    set       x     y
#>    <chr> <dbl> <dbl>
#>  1 1        10  8.04
#>  2 1         8  6.95
#>  3 1        13  7.58
#>  4 1         9  8.81
#>  5 1        11  8.33
#>  6 1        14  9.96
#>  7 1         6  7.24
#>  8 1         4  4.26
#>  9 1        12 10.8 
#> 10 1         7  4.82
#> # … with 34 more rows

不那么笨重的重塑anscombe数据

4 个答案:

选项1：基础R

选项2：“dplyr”+“tidyr”

选项3：“splitstackshape”

选项4：`melt.data.table`

不那么笨重的重塑anscombe数据

4 个答案:

选项1：基础R

选项2：“dplyr”+“tidyr”

选项3：“splitstackshape”

选项4：melt.data.table

选项4：`melt.data.table`