Question

我有以下数据集，有200万个观察值。数据是2008年4月至2010年4月期间。

> head(df)
               Empst Gender Age Agegroup   Marst                         Education State Year Month
1           Employed Female  58    50-60 Married  Some college or associate degree    AL 2008    12
2 Not in labor force   Male  63      61+ Married   Less than a high school diploma    AL 2008    12
3           Employed   Male  60    50-60  Single  Some college or associate degree    AL 2008    12
4 Not in labor force   Male  55    50-60  Single High school graduates, no college    AL 2008    12
5           Employed   Male  36    30-39  Single  Some college or associate degree    AL 2008    12
6           Employed Female  42    40-49 Married       Bachelor's degree or higher    AL 2008    12
  YYYYMM   Weight
1 200812 1876.356
2 200812 2630.503
3 200812 2763.981
4 200812 2693.110
5 200812 2905.784
6 200812 3511.313

我想计算并绘制每月失业率。为计算失业率，我将失业人口的总和除以就业人口和失业人口的总和：

    sum(df[df$Empst=="Unemployed",]$Weight) / 
    sum(df[df$Empst %in% c("Employed","Unemployed"),]$Weight)

要计算每月失业率，我使用for循环：

UnR<-vector()
for(i in levels(factor(df$YYYYMM))){
  temp<-sum(df[df$Empst=="Unemployed" & df$YYYYMM == i,]$Weight) /
        sum(df[df$Empst %in% c("Employed","Unemployed") & df$YYYYMM == i,]$Weight)
  UnR<-append(UnR,temp)
  rm(temp)
}

我的问题是：是否有另一种方法可以使用申请或类似的方式按月计算失业率？谢谢。以下是您需要时的数据集摘要。如果需要进一步澄清，请告诉我。

    Empst            Gender             Age         Agegroup          Marst        
 Not in universe   :  11423   Male  :1266475   Min.   :16.00   16-19:187734   Married:1441114  
 Employed          :1600882   Female:1377638   1st Qu.:31.00   20-29:422699   Married:      0  
 Unemployed        : 132344                    Median :45.00   30-39:431298   Single :1202999  
 Not in labor force: 899464                    Mean   :45.81   40-49:490533   Single :      0  
                                               3rd Qu.:59.00   50-60:518633   Single :      0  
                                               Max.   :85.00   61+  :593216   Single :      0  

                             Education          State              Year          Month       
 Less than a high school diploma  :418636   CA     : 221244   Min.   :2008   Min.   : 1.000  
 High school graduates, no college:802141   TX     : 132650   1st Qu.:2008   1st Qu.: 4.000  
 Some college or associate degree :719492   NY     : 114282   Median :2009   Median : 6.000  
 Bachelor's degree or higher      :703844   FL     : 106116   Mean   :2009   Mean   : 6.385  
                                            PA     :  82482   3rd Qu.:2009   3rd Qu.: 9.000  
                                            IL     :  80816   Max.   :2010   Max.   :12.000  
                                            (Other):1906523                                  
     YYYYMM           Weight     
 Min.   :200804   Min.   :    0  
 1st Qu.:200810   1st Qu.: 1176  
 Median :200904   Median : 2496  
 Mean   :200887   Mean   : 2226  
 3rd Qu.:200910   3rd Qu.: 3139  
 Max.   :201004   Max.   :16822

Answer 1

您是否考虑过使用plyr软件包，特别是ddply？您将数据框投入其中，以独特的时间戳为中心。所以你得到类似的东西：

unemployment_rate.df <- ddply(.data = df,
                              .variables = "YYYYMM",
                              .fun = function(x){
                                return(sum(x$weight[x$Empst== "unemployed"])/sum(x$weight[|x$Empst== "Not in labor force"]))

这应该做的是浏览每个唯一年度组合的数据集并执行失业计算，返回如下数据集：

YYYYMM V1
200812 0.13
200901 0.1
200902 0.43

如果目标是加速你的for循环，另一种获取它的方法（你应该将它应用于for循环）通常是预先指定输出向量的长度，如果你知道的话。因此，要使用此示例，您知道您将拥有与unique（df $ YYYYMM）相同长度的输出向量 - 因此，如果您事先指定，那么循环应该移动得更快，因为R不再必须在每次迭代时扩展向量 - 它只是修改现有（空白）元素。

你也可以避免以这种方式分配/追加，也占用时间--R会话必须为每次迭代减少一些空间 - 只需分配给output_vector [i] 。所以，通过这个例子，你会得到一些看起来像的东西。

#Create an output vector. We can specify length, because we know there'll
#be one entry for each unique value in the YYYYMM column.
#That saves time because it means R just modifies the vector in place.
UnR <- numeric(length(unique(df$YYYYMM))

#And now, the for loop.
for(i in levels(factor(df$YYYYMM))){

  #Instead of creating a temporary object (which takes time), and then appending
  #(which takes time), we can just assign the result to the Ith element of the
  #output vector.
  UnR[i]<-sum(df[df$Empst=="Unemployed" & df$YYYYMM == i,]$Weight) /
        sum(df[df$Empst %in% c("Employed","Unemployed") & df$YYYYMM == i,]$Weight)
}

那应该快得多。 Plyr可能比那更快（我没有对它进行基准测试），但是这些for循环改进是通用的，所以我认为我花时间对它们进行抒情。当人们说R中的for循环很慢时，它们意味着＆＃34;对于具有未知长度输出的循环来说很慢＆＃34;或者＆＃34;对于具有非原始数据类型的循环很慢＆＃34; - 他们是对的。但是对于这样的操作，完全可以创建一个高性能的循环。

Answer 2

您可以使用dplyr执行此操作，有点类似于plyr方法。

require(dplyr)
df %.%
    group_by(YYYYMM) %.%
    summarize(UnR = sum(Weight[Empst == "Employed"]) /
                    sum(Weight[Empst %in% c("Employed", "Unemployed")]))

dplyr几乎肯定会比plyr更快，但除非您的数据非常大，否则您可能不会注意到差异。

计算R中每个月的百分比

2 个答案: