Question

我在使用字段条件汇总计数和聚合函数时遇到问题。

示例：

df = tbl_df(data.frame(
    users=c("1", "1", "1", "1", "2", "2", "2", "3", "3", "4", "4", "4", "4"), 
    projects=c("100", "101", "102", "103", "104", "105", "106", "107", "108", "109", "110", "111", "112"), 
    from=c("0", "0", "111", "106", "111", "101", "0", "101", "0", "100", "106", "108", "0")))

该表包含用户（用户），用户拥有的项目（项目）以及源自其他用户（来自）的其他项目的项目。

我想知道通过使用项目与其他用户建立更多关系的用户是谁。如表所示，用户的项目可以由其他用户（来自）使用，用户可以拥有自己的项目（项目）。

我考虑过计算关系：其他用户使用的用户项目数量以及他不是所有者的用户项目数量。

有人可以使用ddply或其他函数（如summarize或group_by）给我一个提示吗？

我能够使用for生成一个函数，但我知道这不是最合适的解决方案，特别是当我有数百万用户在处理时。

提前致谢！

Answer 1

out <- data.frame(summarize(group_by(df, users),
                     number_of_user_owned_projects = length(df$from[df$from %in% projects]),
                     number_of_projects_from_others = length(unique(from[from != 0]))))
out
  users number_of_user_owned_projects number_of_projects_from_others
1     1                             3                              2
2     2                             2                              2
3     3                             1                              1
4     4                             2                              3

Answer 2

temp = df %>% group_by(from) %>% summarise(cntr = n()) %>% filter(from != 0)

#temp

#    from  cntr
#1    100     1
#2    101     2
#3    106     2
#4    108     1
#5    111     2


output = left_join(df, temp, by = c("projects" = "from")) %>% 
             group_by(users) %>% 
             summarize(user_owned = sum(cntr, na.rm = TRUE), other_owned = sum(from != 0))

#output

#   users user_owned other_owned

#1      1          3           2
#2      2          2           2
#3      3          1           1
#4      4          2           3

如何创建组和计数变量以查找变量之间的关系？

2 个答案: