Question

我们有学生级别的数据，学生可以在学区的教师中找到特定的成绩和科目。

student grade  subject  teacher  school  district female  poverty
1        4       Math     1        1        1       Yes      No
2        4       Math     1        1        1       Yes      No
3        4       Math     1        1        1        No      No
4        4       Math     2        1        1       Yes     Yes
5        4       Math     2        1        1       Yes     Yes
6        4       Math     3        1        1       Yes      No
7        4       Math     4        1        1        No     Yes
8        4       Math     5        1        1        No     Yes
9        4       Math     5        1        1        No     Yes

这些数据包括任何特定年份和科目的超过700,000行，涵盖多个地区和学校的多个年级和教师。

对于每个年级+主题+学校+区的每位独特老师，我们都需要

（a）添加栏目，表明他/她班级的学生百分比是女性，贫困等，
（b）将结果数据框折叠成每个唯一教师一行的数据框在每个年级+科目+学校+区

由此产生的df会像......一样......

teacher  grade  subject  district  school  pct_fem  pct_poor  ...
1         4       Math     1           1     66.66     0      ...
2         4       Math     1           1    100.00     66.66  ...

等等。

我们一直在通过plyr进行，如

ddply(df, .(teacher, grade, subject, district, school), transform, 
  n_students=length(unique(student)),
  n_fem=length(unique(student[female=="Yes"])), 
  pct_fem = (n_fem/n_students)*100)

但是，这似乎需要永远，并且经常会生成错误消息，导致代码无法找到n_fem或n_students。

如果我们编写多个ddply（）语句，一次只生成一个列，那么它可以正常工作，但这似乎显然效率低下，因为我们必须将这些新列合并到一个新的数据帧中，然后将此数据帧折叠为单个记录每个年级，学科，学区和学校的老师。

使用这些大型数据集实现我们想要的最有效方法是什么？任何提示都将非常感激。

Answer 1

您可以定义要应用于每个组中每列的函数：

<?php

namespace AppBundle;

use Symfony\Component\HttpKernel\Bundle\Bundle;

class AppBundle extends Bundle
{
    public function getParent()
    {
        return 'SyliusWebBundle';
    }
}

有几种不同的方法可以从这里开始：

基础R

prop_Yes = function(x){
  tab = prop.table(table(factor(x,levels=c("Yes","No"))))
  tab[names(tab)=="Yes"]
}

g_vars = c("grade", "subject", "teacher", "school", "district")
p_vars = c("female", "poverty")

<强> data.table

aggregate(DF[p_vars], DF[g_vars], prop_Yes)

  grade subject teacher school district    female poverty
1     4    Math       1      1        1 0.6666667       0
2     4    Math       2      1        1 1.0000000       1
3     4    Math       3      1        1 1.0000000       0
4     4    Math       4      1        1 0.0000000       1
5     4    Math       5      1        1 0.0000000       1

<强> dplyr

library(data.table)    
setDT(DF)[ , lapply(.SD, prop_Yes), by=g_vars, .SDcols=p_vars]

   grade subject teacher school district    female poverty
1:     4    Math       1      1        1 0.6666667       0
2:     4    Math       2      1        1 1.0000000       1
3:     4    Math       3      1        1 1.0000000       0
4:     4    Math       4      1        1 0.0000000       1
5:     4    Math       5      1        1 0.0000000       1

通过对变量进行分组添加许多计算列，然后折叠df

1 个答案: