Question

假设以下数据：

GroupId <-          c(1,1,1,1,2,2,2,3,3)
IndId <-            c(1,1,2,2,3,4,4,5,5)
IndGroupProperty <- c(1,2,1,2,3,3,4,5,6)
PropertyType <-     c(1,2,1,2,2,2,1,2,2)

df <- data.frame(GroupId, IndId, IndGroupProperty, PropertyType)
df

这些是多级数据，其中每个组GroupId由一个或多个个人IndId组成，可以访问一个或多个属性IndGroupProperty，这些属性对于他们各自的群组是唯一的（即属性1属于组1而没有其他组）。这些属性均属于PropertyType类型。

任务是用虚拟变量标记每一行，其中至少有一个属于组中每个个体的第1类属性。

对于我们的样本数据，这只是：

ValidGroup <-       c(1,1,1,1,0,0,0,0,0)
df <- data.frame(df, ValidGroup)
df

前四行标记为1，因为组（1）的每个个体（1,2）都可以访问类型1属性（1）。后续三行属于组（2），其中只有个人（4）可以访问类型1属性（4）。因此这些都没有标记（0）。最后两行也没有收到任何标志。组（3）仅由一个人（5）组成，可以访问两个第二类属性（5,6）。

我已经研究了几个命令：levels似乎缺乏团队支持; getGroups包中的nlme不喜欢我的真实数据的输入;我想在doBy中可能会有一些有用的内容，但summaryBy似乎不会将levels作为函数。

解决方案编辑：Henrik的dplyr解决方案包含在一个函数中：

foobar <- function(object, group, ind, type){
groupvar <- deparse(substitute(group)) 
indvar <- deparse(substitute(ind)) 
typevar <- deparse(substitute(type)) 
eval(substitute(
object[, c(groupvar, indvar, typevar)] %.%
  group_by(group, ind) %.%
  mutate(type1 = any(type == 1))  %.%
  group_by(group, add = FALSE) %.%
  mutate(ValidGroup = all(type1) * 1) %.%
  select(-type1)
  ))
}

Answer 1

您也可以尝试ave：

# for each individual within group, calculate number of 1s in PropertyType
v1 <- with(df, ave(PropertyType, list(GroupId, IndId), FUN = function(x) sum(x == 1)))

# within each group, check if all v1 is 1.
# The boolean result is coerced to 1 and 0 by ave.  
df$ValidGroup <- ave(v1, df$GroupId, FUN = function(x) all(x == 1))

#   GroupId IndId IndGroupProperty PropertyType ValidGroup
# 1       1     1                1            1          1
# 2       1     1                2            2          1
# 3       1     2                1            1          1
# 4       1     2                2            2          1
# 5       2     3                3            2          0
# 6       2     4                3            2          0
# 7       2     4                4            1          0
# 8       3     5                5            2          0
# 9       3     5                6            2          0

修改为不同大小的数据集添加了dplyr替代和基准：原始数据，以及比原始数据大10倍和100倍的数据。

首先在功能中包含备选方案：

fun_ave <- function(df){
  v1 <- with(df, ave(PropertyType, list(GroupId, IndId), FUN = function(x) sum(x == 1)))
df$ValidGroup <- ave(v1, list(df$GroupId), FUN = function(x) all(x == 1))
df  
}

library(dplyr)
fun_dp <- function(df){
df %.%
  group_by(GroupId, IndId) %.%
  mutate(
    type1 = any(PropertyType == 1)) %.%
  group_by(GroupId, add = FALSE) %.%
  mutate(
    ValidGroup = all(type1) * 1) %.%
  select(-type1)
}


fun_by <- function(df){
  bar <- by(data=df,INDICES=df$GroupId,FUN=function(xx){
    foo <- table(xx$IndId,xx$PropertyType)
    if ( !("1" %in% colnames(foo)) ) {
      return(FALSE)   # no PropertyType=1 at all in this group
    } else {
      return(all(foo[,"1"]>0))    # return whether all IndId have an 1 entry
    }})
  cbind(df,ValidGroup = as.integer(bar[as.character(df$GroupId)]))
}

基准

原始数据：

microbenchmark(
  fun_ave(df),
  fun_dp(df),
  fun_by(df))

# Unit: microseconds
#        expr      min        lq    median        uq       max neval
# fun_ave(df)  497.964  519.8215  538.8275  563.5355   651.535   100
#  fun_dp(df)  851.861  870.6765  931.1170  968.5590  1760.360   100
#  fun_by(df) 1343.743 1412.5455 1464.6225 1581.8915 12588.607   100

在一个微小的数据集上ave的速度约为dplyr的两倍，比by的速度快2.5倍。

生成一些更大的数据;团体和个人数量的10倍

GroupId <- sample(1:30, 100, replace = TRUE)
IndId <- sample(1:50, 100, replace = TRUE)
PropertyType <- sample(1:2, 100, replace = TRUE)
df2 <- data.frame(GroupId, IndId, PropertyType)

microbenchmark(
  fun_ave(df2),
  fun_dp(df2),
  fun_by(df2))
# Unit: milliseconds
#          expr      min       lq    median        uq       max neval
#  fun_ave(df2) 2.928865 3.185259  3.270978  3.435002  5.151457   100
#   fun_dp(df2) 1.079176 1.231226  1.273610  1.352866  2.717896   100
#   fun_by(df2) 9.464359 9.855317 10.137180 10.484994 12.445680   100

dplyr比ave快三倍，比by快近10倍。

团体和个人数量的100倍

GroupId <- sample(1:300, 1000, replace = TRUE)
IndId <- sample(1:500, 1000, replace = TRUE)
PropertyType <- sample(1:2, 1000, replace = TRUE)
df2 <- data.frame(GroupId, IndId, PropertyType)

microbenchmark(
  fun_ave(df2),
  fun_dp(df2),
  fun_by(df2))

# Unit: milliseconds
# expr        min         lq    median        uq      max neval
# fun_ave(df2) 337.889895 392.983915 413.37554 441.58179 549.5516   100
#  fun_dp(df2)   3.253872   3.477195   3.58173   3.73378  75.8730   100
#  fun_by(df2)  92.248791 102.122733 104.09577 109.99285 186.6829   100

ave现在真的失去了基础。 dplyr比by快近30倍，比ave快100多倍。

Answer 2

试试这个：

bar <- by(data=df,INDICES=df$GroupId,FUN=function(xx){
    foo <- table(xx$IndId,xx$PropertyType)
    if ( !("1" %in% colnames(foo)) ) {
        return(FALSE)   # no PropertyType=1 at all in this group
    } else {
        return(all(foo[,"1"]>0))    # return whether all IndId have an 1 entry
    }})
cbind(df,bar[as.character(df$GroupId)])

关键是使用by()通过分组变量应用函数，这里是df$GroupId。要应用的函数是匿名函数。对于每个块（由分组变量定义），它会创建table和IndId条目的PropertyType。然后看看是否＆＃34; 1＆＃34;完全出现在PropertyType中 - 如果没有，则返回FALSE，如果是，则查看每个IndId是否至少有一个＆＃34; 1＆＃34;条目（即table的＆＃34; 1＆＃34;列中的所有条目是否都是＆gt; 0。

我们将by()调用的结果存储在结构bar中，该结构根据分组变量中的级别命名。这反过来允许我们将结果回滚到原始data.frame。请注意我在这里使用as.character()来确保整数被解释为条目名称，而不是条目数字。当事物的名称可以被解释为数字时，经常会发生坏事。

如果您真的想要0-1结果而不是TRUE-FALSE，只需添加as.numeric()。

EDIT。让我们把它变成一个函数。

foobar <- function(object, group, ind, type) { bar <- by(data=object,INDICES=object[,group],FUN=function(xx){ foo <- table(xx[,ind],xx[,type]) if ( !("1" %in% colnames(foo)) ) { return(FALSE) # no PropertyType=1 at all in this group } else { return(all(foo[,"1"]>0)) # return whether all IndId have an 1 entry }}) cbind(object,bar[as.character(object[,group])]) } foobar(df,"GroupId","IndId","PropertyType")

这仍然要求目标完全是＆＃34; 1＆＃34;，但当然这也可以作为参数包含在函数定义中。请务必保持列名和包含列名的变量。

标记所有成员在R中满足特定要求的组

2 个答案: