用来自两个不同表的数据创建一个具有计算字段的新表

时间:2019-07-23 02:29:02

标签: r

我必须访问数据表:

第一个表格显示了导师的姓名以及分配给每个导师的所有学生。

    mentor          student_name
    Dr. Brown       Michael
    Dr. Brown       Diana
    Dr. Brown       Peter
    Dr. Brown       Christopher
    Dr. Brown       Stacy
    Ms. Lindblom    Rose
    Ms. Lindblom    Anne
    Ms. Lindblom    Steven
    Ms. Lindblom    Gloria
    Mr. Apple       Juan
    Mr. Apple       Francis
    Mr. Apple       David
    Mr. Apple       Sonja
    Mr. Apple       Dakota
    Mr. Apple       Latoya
    Mr. Apple       Avril
    Mr. Apple       James
    Mr. Apple       Stewart
    Mr. Apple       Sophia

第二张表显示了导师与学生之间的一对一辅导课程

 mentor         date_of_tutoring    student_name
 Dr. Brown      07/14/2019          Peter
 Dr. Brown      07/15/2019          Christopher
 Ms. Lindblom   06/28/2019          Gloria
 Mr. Apple      06/20/2019          Sophia
 Mr. Apple      06/22/2019          Latoya
 Mr. Apple      06/25/2019          Juan
 Mr. Apple      06/26/2019          Avril

每位导师需要在学年与每个学生进行一次导师辅导。

我想创建一个新表,以显示完成指导任务的百分比。每位导师与所有学生进行一对一的导师辅导会时,将完成任务(100%)。

例如,基于表2的数据以及分配给每个导师的学生数,我想创建一个新的表,如下所示:

 teacher           %_mentoring_completed
 Dr. Brown          40%
 Ms. Lindblom       25%
 Mr. Apple          40%

2 个答案:

答案 0 :(得分:1)

一种选择是将两个数据集按“ mentor”分组进行联接,以获取非NA逻辑向量的mean

library(dplyr)
library(stringr)
left_join(df1, df2) %>% 
   group_by(mentor) %>% 
    summarise(PercentageMentoringCompleted = str_c(100 * 
             mean(!is.na(date_of_tutoring)), "%"))
# A tibble: 3 x 2
#  mentor       PercentageMentoringCompleted
#  <chr>        <chr>                       
#1 Dr. Brown    40%                         
#2 Mr. Apple    40%                         
#3 Ms. Lindblom 25%   

或另一种选择是使用count

library(purrr)
list(df2, df1) %>% map(~ .x %>% 
           dplyr::count(mentor)) %>% 
           reduce(inner_join, by = 'mentor') %>%
           transmute(mentor, perc = 100 * n.x/n.y)

或与base R一起使用mergeaggregate

aggregate(PercentageMentoringCompleted ~ mentor,
  transform(merge(df1, df2, all.x = TRUE), 
       PercentageMentoringCompleted = !is.na(date_of_tutoring)), mean) 

数据

df1 <- structure(list(mentor = c("Dr. Brown", "Dr. Brown", "Dr. Brown", 
"Dr. Brown", "Dr. Brown", "Ms. Lindblom", "Ms. Lindblom", "Ms. Lindblom", 
"Ms. Lindblom", "Mr. Apple", "Mr. Apple", "Mr. Apple", "Mr. Apple", 
"Mr. Apple", "Mr. Apple", "Mr. Apple", "Mr. Apple", "Mr. Apple", 
"Mr. Apple"), student_name = c("Michael", "Diana", "Peter", "Christopher", 
"Stacy", "Rose", "Anne", "Steven", "Gloria", "Juan", "Francis", 
"David", "Sonja", "Dakota", "Latoya", "Avril", "James", "Stewart", 
"Sophia")), class = "data.frame", row.names = c(NA, -19L))

df2 <- structure(list(mentor = c("Dr. Brown", "Dr. Brown", "Ms. Lindblom", 
"Mr. Apple", "Mr. Apple", "Mr. Apple", "Mr. Apple"), 
 date_of_tutoring = c("07/14/2019", 
"07/15/2019", "06/28/2019", "06/20/2019", "06/22/2019", "06/25/2019", 
"06/26/2019"), student_name = c("Peter", "Christopher", "Gloria", 
"Sophia", "Latoya", "Juan", "Avril")), class = "data.frame", row.names = c(NA, 
-7L))

答案 1 :(得分:0)

我们可以使用table来计数mentor的频率,假设两个数据帧中都存在唯一的指导者。

stack(table(df2$mentor)/table(df1$mentor))

#  values          ind
#1   0.40    Dr. Brown
#2   0.40    Mr. Apple
#3   0.25 Ms. Lindblom

如果它们不是相同的,或者它们的顺序不同,则更安全的选择是将factorlevels一起使用,以正确的顺序获取输出

stack(table(factor(df2$mentor, levels = unique(df1$mentor)))/
      table(factor(df1$mentor, levels = unique(df1$mentor))))