将群组数据与用户数据进行匹配并获取群组

时间:2018-09-26 02:33:25

标签: r dplyr

我有两个数据集,一个是像这样的各种产品

User Product
A .   1
A .   2
A .   3
B .   1
B .   3
B .   4

和另一个桌子

Group Product
X1 .   1
X1 .   2
X1 .   4
X2 .   1
X2 .   3

我的要求是,如果某个用户组中存在所有产品,则该用户属于该组,并且看起来像这样

User X1 X2
A .   1  0
B .   0 .1

我尝试手动处理循环,尝试与自定义函数匹配,但是我的实际数据量很大,解决方案也不完美。

需要帮助。

3 个答案:

答案 0 :(得分:3)

您可以使用一些简洁的代码来完成此操作。

首先,提供一些无点数据(我把不必要的点取了,如果我错了,请纠正我):

x1 <- read.table(header=TRUE, stringsAsFactors=FALSE, text='
User Product
A    1
A    2
A    3
B    1
B    3
B    4')
x2 <- read.table(header=TRUE, stringsAsFactors=FALSE, text='
Group Product
X1    1
X1    2
X1    4
X2    1
X2    3')
out <- read.table(header=TRUE, stringsAsFactors=FALSE, text='
User X1 X2
A    1  0
B    0  1')

所需的软件包:

library(dplyr)
library(tidyr)
library(purrr)

x1n <- group_by(x1, User) %>% nest(.key = "x1prod")
x2n <- group_by(x2, Group) %>% nest(.key = "x2prod")

crossing(User = x1n$User, Group = x2n$Group) %>%
  left_join(x1n, by = "User") %>%
  left_join(x2n, by = "Group") %>%
  mutate(allx = map2_lgl(x1prod, x2prod, ~ all(.y$Product %in% .x$Product)))
# # A tibble: 4 x 5
#   User  Group x1prod           x2prod           allx 
#   <chr> <chr> <list>           <list>           <lgl>
# 1 A     X1    <tibble [3 x 1]> <tibble [3 x 1]> FALSE
# 2 A     X2    <tibble [3 x 1]> <tibble [2 x 1]> TRUE 
# 3 B     X1    <tibble [3 x 1]> <tibble [3 x 1]> FALSE
# 4 B     X2    <tibble [3 x 1]> <tibble [2 x 1]> TRUE 

这当然不是您想要的结果,但是我将显示该输出以演示嵌套在做什么,并且我们正在逐行比较x1prod(单列Product)和{{1 }}(相同)。从这里,只需删除列并扩展就可以了:

x2prod

(我还假设您期望的输出有误,因为crossing(User = x1n$User, Group = x2n$Group) %>% left_join(x1n, by = "User") %>% left_join(x2n, by = "Group") %>% mutate(allx = map2_lgl(x1prod, x2prod, ~ all(.y$Product %in% .x$Product))) %>% select(-x1prod, -x2prod) %>% spread(Group, allx) # # A tibble: 2 x 3 # User X1 X2 # <chr> <lgl> <lgl> # 1 A FALSE TRUE # 2 B FALSE TRUE 的{​​{1}}组中没有“ 4”。)

答案 1 :(得分:0)

另一个仅使用dplyr并且循环的答案是:

library(dplyr)
myFunction = function(df1, df2, user, group, product){
  user = deparse(substitute(user))
  product = deparse(substitute(product))
  group = deparse(substitute(group))
  answer = data.frame(User = as.character(df1[1, user]))
  for(i in unique(df2[,group])){
    temp = df1 %>% summarise(!!i := if_else(all(df2[which(df2[,group] == i),][,product] %in% unique(df1[[product]])), 1, 0))
    answer = cbind(answer, temp[,i])
  }
  return(answer)
}

df1 %>% group_by(User) %>% do(myFunction(., df2, User, Group, Product))
df1

# A tibble: 2 x 3
# Groups:   User [2]
  User     X1    X2
  <chr> <dbl> <dbl>
1 1         0     1
2 2         0     1

答案 2 :(得分:0)

这是仅使用dplyrtidyr的解决方案-

library(dplyr)
library(tidyr)

user_product <- data.frame(User = rep(LETTERS[1:2], each = 3), Product = c(1:3, 1, 3, 4))
group_product <- data.frame(Group = c("x1", "x1", "x1", "x2", "x2"), Product = c(1,2,4,1,3))

left_join(user_product, group_product, by = "Product") %>%
  left_join(group_product, by = "Group") %>%
  group_by(User, Group) %>%
  summarize(
    test = all(Product.y %in% Product.x)
  ) %>%
  spread(Group, test)

# A tibble: 2 x 3
# Groups:   User [2]
  User  x1    x2   
  <fct> <lgl> <lgl>
1 A     FALSE TRUE 
2 B     FALSE TRUE

有点类似于@ r2evans已经共享的内容,但是冗长得多,更易于理解,并且对软件包的依赖也更少。