我们如何选择某个群体的元素

时间:2019-07-03 02:19:41

标签: r dataframe

假设我的数据集中有3列,如下所示:

    Household     person     activity
        1           1          home
        1           1          school
        1           1          shopping
        1           1          home
        1           2          home
        1           2          work
        1           2          home
        2           1          home
        2           1          work
        2           2          home
        2           2          school
        2           2          home

第一列是家庭人数。第二列是该家庭中的人数,第三列是该人的活动。

如果某人的第一个和最后一个活动是家,则该人的一组活动就是家。

有没有办法丢弃至少其中一名成员的活动不是以家庭为基础的家庭? 在上面的示例中,第一家庭的所有成员的活动都是基于家庭成员的,但是在第二家庭中,第一人称的活动不是基于家庭的(家庭--->工作),因此我想放弃第二家庭。

2 个答案:

答案 0 :(得分:3)

我们首先可以创建一个temp变量以标记personfirst lastactivity的{​​{1}},然后选择那些{{ 1}}的{​​{1}} "home"值为"household"

all

使用相同的逻辑,我们还可以使用基数R temp

TRUE

要知道我们选择或删除了哪个library(dplyr) df %>% group_by(Household, person) %>% mutate(temp = first(activity) == "home" & last(activity) == "home") %>% group_by(Household) %>% filter(all(temp)) %>% select(-temp) # Household person activity # <int> <int> <fct> #1 1 1 home #2 1 1 school #3 1 1 shopping #4 1 1 home #5 1 2 home #6 1 2 work #7 1 2 home

ave

答案 1 :(得分:2)

我们可以使用data.table方法来做到这一点。将“ data.frame”转换为“ data.table”(setDT(df)),并按“ Household”,“ person”分组,并根据first中的“ home”值创建逻辑向量,然后last“活动”列,然后按“住户”分组,如果all逻辑向量中的值为TRUE,则过滤Data.table(.SD)的子集

library(data.table)
setDT(df)[, ind :=  first(activity) == "home" & last(activity) == "home",
     .(Household, person)][, .SD[all(ind)], Household][, ind := NULL][]
#   Household person activity
#1:         1      1     home
#2:         1      1   school
#3:         1      1 shopping
#4:         1      1     home
#5:         1      2     home
#6:         1      2     work
#7:         1      2     home

如果我们需要将“ selected”作为逻辑汇总列

setDT(df)[, .(selected = all(diff(.SD[, .I[unique(activity[c(1, .N)]) ==
           "home"], person]$V1) == 1)), .(Household)]
#  Household  selected
#1:         1  TRUE
#2:         2 FALSE

并从上方获得预期的输出

setDT(df)[df[, .I[all(diff(.SD[, .I[unique(activity[c(1, .N)]) == 
         "home"], person]$V1) == 1)], .(Household)]$V1]
#    Household person activity
#1:         1      1     home
#2:         1      1   school
#3:         1      1 shopping
#4:         1      1     home
#5:         1      2     home
#6:         1      2     work
#7:         1      2     home

数据

df <- structure(list(Household = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L), person = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 
2L, 2L, 2L), activity = c("home", "school", "shopping", "home", 
"home", "work", "home", "home", "work", "home", "school", "home"
)), class = "data.frame", row.names = c(NA, -12L))