假设我的数据集中有3列,如下所示:
Household person activity
1 1 home
1 1 school
1 1 shopping
1 1 home
1 2 home
1 2 work
1 2 home
2 1 home
2 1 work
2 2 home
2 2 school
2 2 home
第一列是家庭人数。第二列是该家庭中的人数,第三列是该人的活动。
如果某人的第一个和最后一个活动是家,则该人的一组活动就是家。有没有办法丢弃至少其中一名成员的活动不是以家庭为基础的家庭? 在上面的示例中,第一家庭的所有成员的活动都是基于家庭成员的,但是在第二家庭中,第一人称的活动不是基于家庭的(家庭--->工作),因此我想放弃第二家庭。
答案 0 :(得分:3)
我们首先可以创建一个temp
变量以标记person
和first
last
为activity
的{{1}},然后选择那些{{ 1}}的{{1}} "home"
值为"household"
。
all
使用相同的逻辑,我们还可以使用基数R temp
TRUE
要知道我们选择或删除了哪个library(dplyr)
df %>%
group_by(Household, person) %>%
mutate(temp = first(activity) == "home" & last(activity) == "home") %>%
group_by(Household) %>%
filter(all(temp)) %>%
select(-temp)
# Household person activity
# <int> <int> <fct>
#1 1 1 home
#2 1 1 school
#3 1 1 shopping
#4 1 1 home
#5 1 2 home
#6 1 2 work
#7 1 2 home
ave
答案 1 :(得分:2)
我们可以使用data.table
方法来做到这一点。将“ data.frame”转换为“ data.table”(setDT(df)
),并按“ Household”,“ person”分组,并根据first
中的“ home”值创建逻辑向量,然后last
“活动”列,然后按“住户”分组,如果all
逻辑向量中的值为TRUE,则过滤Data.table(.SD
)的子集
library(data.table)
setDT(df)[, ind := first(activity) == "home" & last(activity) == "home",
.(Household, person)][, .SD[all(ind)], Household][, ind := NULL][]
# Household person activity
#1: 1 1 home
#2: 1 1 school
#3: 1 1 shopping
#4: 1 1 home
#5: 1 2 home
#6: 1 2 work
#7: 1 2 home
如果我们需要将“ selected”作为逻辑汇总列
setDT(df)[, .(selected = all(diff(.SD[, .I[unique(activity[c(1, .N)]) ==
"home"], person]$V1) == 1)), .(Household)]
# Household selected
#1: 1 TRUE
#2: 2 FALSE
并从上方获得预期的输出
setDT(df)[df[, .I[all(diff(.SD[, .I[unique(activity[c(1, .N)]) ==
"home"], person]$V1) == 1)], .(Household)]$V1]
# Household person activity
#1: 1 1 home
#2: 1 1 school
#3: 1 1 shopping
#4: 1 1 home
#5: 1 2 home
#6: 1 2 work
#7: 1 2 home
df <- structure(list(Household = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L), person = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L,
2L, 2L, 2L), activity = c("home", "school", "shopping", "home",
"home", "work", "home", "home", "work", "home", "school", "home"
)), class = "data.frame", row.names = c(NA, -12L))