Question

我知道这个问题很简单，但是在没有创建步骤对象的情况下找不到解决方案，我想要一个单行代码，或者尽可能最简单的代码。

假设我有一个名为df的数据框，列 x ， y ， z ：

x<-c(rep('place1',33),rep('place2',33),rep('place3',34))
y<-sample(c('type1','type2','type3','type4','type5'),100,replace=T)
z<-sample(40:80,100,replace=T)
df<-data.frame(x,y,z)

我想为 x 和 y （type1 in type1，type2 in）的每个级别组合获取 z 的所有子集place1，type3 in place1 ... type4 in place3 and type5 in place3）。像这样：

[[place1]]
[type1]
[1] 57 73 74 47 52 61

[type2]
[1] 72 76 64 62 73 75
...

[type5]
...

[[place3]]
[type1]
...

[type5]

如果可能，我怎样才能访问每个子集？

我在split内尝试了嵌套lapply，但没有成功。

很抱歉这个简单的问题，但找不到合适的解决方案。

任何帮助都将不胜感激。

Answer 1

这是一种方法。你使用变量x分割你的df。然后，使用变量y再次拆分拆分每个数据帧。通过这种方式，您可以按照自己想要的方式对数据进行子集化。最后，我得到了一些修剪后的结果。

lapply(split(df, f = df$x), function(x) split(x, f = x$y)

#$place1
#$place1$type1
#        x     y  z
#5  place1 type1 46
#7  place1 type1 41

#$place1$type2
#        x     y  z
#3  place1 type2 44
#4  place1 type2 59

如果你只想要z的值，你可以这样做：

lapply(split(df, f = df$x), function(x) split(x$z, f = x$y))

#$place1
#$place1$type1
#[1] 46 41 50 59 54 51 66 70

#$place1$type2
#[1] 44 59 60 53 74 46 67 70

#$place1$type3
#[1] 63 70 80 44 73 74 58

#$place1$type4
#[1] 45 67 52 72 45 48 79 65

#$place1$type5
#[1] 75 54

修改

查看@ user295691提供的链接，您也可以执行以下操作。

split(df$z, interaction(df$x,df$y))

如果你想要每个矢量都有z值，你可以这样做：

list2env(split(df$z, interaction(df$x,df$y)), .GlobalEnv)

<强> EDIT2

OP希望使用此数据运行统计数据。因此，我认为留下以下内容是个好主意。如果需要在列表中创建具有不同向量长度的数据框，则可以执行类似的操作。 listvectors2df允许您使用NA创建数据框。

ana <- split(df$z, interaction(df$x,df$y)) # I used a good answer in this post and wrote the following. #http://stackoverflow.com/questions/15201305/how-to-convert-a-list-consisting-of-vector-of-different-lengths-to-a-usable-data listvectors2df <- function(l){ n.obs <- sapply(l, length) seq.max <- seq_len(max(n.obs)) mydf <- data.frame(sapply(l, "[", i = seq.max), stringsAsFactors = FALSE) } bob <- listvectors2df(ana)

Answer 2

也可以使用拆分与交互：

split(df, interaction(x,y))
$place1.type1
        x     y  z
6  place1 type1 57
25 place1 type1 55
27 place1 type1 55
28 place1 type1 75
29 place1 type1 54

$place2.type1
        x     y  z
36 place2 type1 70
42 place2 type1 69
45 place2 type1 78
57 place2 type1 79
59 place2 type1 46
60 place2 type1 45
63 place2 type1 73
64 place2 type1 79

$place3.type1
        x     y  z
85 place3 type1 54

访问每个元素：

> ll = split(df, interaction(x,y))
> 
> ll[[1]]
        x     y  z
6  place1 type1 57
25 place1 type1 55
27 place1 type1 55
28 place1 type1 75
29 place1 type1 54
> 
> ll[[2]]
        x     y  z
36 place2 type1 70
42 place2 type1 69
45 place2 type1 78
57 place2 type1 79
59 place2 type1 46
60 place2 type1 45
63 place2 type1 73
64 place2 type1 79

也可以使用

data.table：

library(data.table)
dtt = data.table(df)

dtt[order(x,y),list(meanz=mean(z), maxz=max(z), sumz=sum(z)),by=list(x,y)]
         x     y    meanz maxz sumz
 1: place1 type1 63.11111   80  568
 2: place1 type2 68.12500   79  545
 3: place1 type3 58.80000   76  294
 4: place1 type4 59.83333   79  359
 5: place1 type5 59.40000   80  297
 6: place2 type1 55.85714   69  391
 7: place2 type2 59.71429   71  418
 8: place2 type3 61.00000   76  305
 9: place2 type4 53.63636   71  590
10: place2 type5 44.66667   46  134
11: place3 type1 62.16667   74  373
12: place3 type2 63.42857   80  444
13: place3 type3 64.00000   77  384
14: place3 type4 61.28571   80  429
15: place3 type5 51.00000   60  408

Answer 3

有几种解决方案。第一个是jazzurro提供的lapply / split。您也可以将这些因素组合成一个因子，例如

> split(df, paste(df$x, df$y))
$`place1 type1`
        x     y  z
3  place1 type1 57
24 place1 type1 54

$`place1 type2`
        x     y  z
1  place1 type2 67
6  place1 type2 75
7  place1 type2 72
12 place1 type2 57
...

另一种解决方案是使用对多级分组具有内在支持的库，例如data.tables或plyr / dplyr。在dplyr中，操作看起来像（包括摘要，在这种情况下是第三列的平均值和最大值）

> df %>% group_by(x, y) %>% summarise(mean(z), max(z))
Source: local data frame [15 x 4]
Groups: x

        x     y  mean(z) max(z)
1  place1 type1 55.50000     57
2  place1 type2 65.50000     80
3  place1 type3 60.40000     78
4  place1 type4 57.12500     73
...

通过两个因素（所有级别）进行子集，使用简单的代码

3 个答案: