我有两个数据集,一个是像这样的各种产品
User Product
A . 1
A . 2
A . 3
B . 1
B . 3
B . 4
和另一个桌子
Group Product
X1 . 1
X1 . 2
X1 . 4
X2 . 1
X2 . 3
我的要求是,如果某个用户组中存在所有产品,则该用户属于该组,并且看起来像这样
User X1 X2
A . 1 0
B . 0 .1
我尝试手动处理循环,尝试与自定义函数匹配,但是我的实际数据量很大,解决方案也不完美。
需要帮助。
答案 0 :(得分:3)
您可以使用一些简洁的代码来完成此操作。
首先,提供一些无点数据(我把不必要的点取了,如果我错了,请纠正我):
x1 <- read.table(header=TRUE, stringsAsFactors=FALSE, text='
User Product
A 1
A 2
A 3
B 1
B 3
B 4')
x2 <- read.table(header=TRUE, stringsAsFactors=FALSE, text='
Group Product
X1 1
X1 2
X1 4
X2 1
X2 3')
out <- read.table(header=TRUE, stringsAsFactors=FALSE, text='
User X1 X2
A 1 0
B 0 1')
所需的软件包:
library(dplyr)
library(tidyr)
library(purrr)
x1n <- group_by(x1, User) %>% nest(.key = "x1prod")
x2n <- group_by(x2, Group) %>% nest(.key = "x2prod")
crossing(User = x1n$User, Group = x2n$Group) %>%
left_join(x1n, by = "User") %>%
left_join(x2n, by = "Group") %>%
mutate(allx = map2_lgl(x1prod, x2prod, ~ all(.y$Product %in% .x$Product)))
# # A tibble: 4 x 5
# User Group x1prod x2prod allx
# <chr> <chr> <list> <list> <lgl>
# 1 A X1 <tibble [3 x 1]> <tibble [3 x 1]> FALSE
# 2 A X2 <tibble [3 x 1]> <tibble [2 x 1]> TRUE
# 3 B X1 <tibble [3 x 1]> <tibble [3 x 1]> FALSE
# 4 B X2 <tibble [3 x 1]> <tibble [2 x 1]> TRUE
这当然不是您想要的结果,但是我将显示该输出以演示嵌套在做什么,并且我们正在逐行比较x1prod
(单列Product
)和{{1 }}(相同)。从这里,只需删除列并扩展就可以了:
x2prod
(我还假设您期望的输出有误,因为crossing(User = x1n$User, Group = x2n$Group) %>%
left_join(x1n, by = "User") %>%
left_join(x2n, by = "Group") %>%
mutate(allx = map2_lgl(x1prod, x2prod, ~ all(.y$Product %in% .x$Product))) %>%
select(-x1prod, -x2prod) %>%
spread(Group, allx)
# # A tibble: 2 x 3
# User X1 X2
# <chr> <lgl> <lgl>
# 1 A FALSE TRUE
# 2 B FALSE TRUE
的{{1}}组中没有“ 4”。)
答案 1 :(得分:0)
另一个仅使用dplyr
并且循环的答案是:
library(dplyr)
myFunction = function(df1, df2, user, group, product){
user = deparse(substitute(user))
product = deparse(substitute(product))
group = deparse(substitute(group))
answer = data.frame(User = as.character(df1[1, user]))
for(i in unique(df2[,group])){
temp = df1 %>% summarise(!!i := if_else(all(df2[which(df2[,group] == i),][,product] %in% unique(df1[[product]])), 1, 0))
answer = cbind(answer, temp[,i])
}
return(answer)
}
df1 %>% group_by(User) %>% do(myFunction(., df2, User, Group, Product))
df1
# A tibble: 2 x 3
# Groups: User [2]
User X1 X2
<chr> <dbl> <dbl>
1 1 0 1
2 2 0 1
答案 2 :(得分:0)
这是仅使用dplyr
和tidyr
的解决方案-
library(dplyr)
library(tidyr)
user_product <- data.frame(User = rep(LETTERS[1:2], each = 3), Product = c(1:3, 1, 3, 4))
group_product <- data.frame(Group = c("x1", "x1", "x1", "x2", "x2"), Product = c(1,2,4,1,3))
left_join(user_product, group_product, by = "Product") %>%
left_join(group_product, by = "Group") %>%
group_by(User, Group) %>%
summarize(
test = all(Product.y %in% Product.x)
) %>%
spread(Group, test)
# A tibble: 2 x 3
# Groups: User [2]
User x1 x2
<fct> <lgl> <lgl>
1 A FALSE TRUE
2 B FALSE TRUE
有点类似于@ r2evans已经共享的内容,但是冗长得多,更易于理解,并且对软件包的依赖也更少。