计算R中每个月的百分比

时间:2014-04-07 17:20:58

标签: r

我有以下数据集,有200万个观察值。数据是2008年4月至2010年4月期间。

> head(df)
               Empst Gender Age Agegroup   Marst                         Education State Year Month
1           Employed Female  58    50-60 Married  Some college or associate degree    AL 2008    12
2 Not in labor force   Male  63      61+ Married   Less than a high school diploma    AL 2008    12
3           Employed   Male  60    50-60  Single  Some college or associate degree    AL 2008    12
4 Not in labor force   Male  55    50-60  Single High school graduates, no college    AL 2008    12
5           Employed   Male  36    30-39  Single  Some college or associate degree    AL 2008    12
6           Employed Female  42    40-49 Married       Bachelor's degree or higher    AL 2008    12
  YYYYMM   Weight
1 200812 1876.356
2 200812 2630.503
3 200812 2763.981
4 200812 2693.110
5 200812 2905.784
6 200812 3511.313

我想计算并绘制每月失业率。为计算失业率,我将失业人口的总和除以就业人口和失业人口的总和:

    sum(df[df$Empst=="Unemployed",]$Weight) / 
    sum(df[df$Empst %in% c("Employed","Unemployed"),]$Weight)

要计算每月失业率,我使用for循环:

UnR<-vector()
for(i in levels(factor(df$YYYYMM))){
  temp<-sum(df[df$Empst=="Unemployed" & df$YYYYMM == i,]$Weight) /
        sum(df[df$Empst %in% c("Employed","Unemployed") & df$YYYYMM == i,]$Weight)
  UnR<-append(UnR,temp)
  rm(temp)
}

我的问题是:是否有另一种方法可以使用申请或类似的方式按月计算失业率?谢谢。以下是您需要时的数据集摘要。如果需要进一步澄清,请告诉我。

    Empst            Gender             Age         Agegroup          Marst        
 Not in universe   :  11423   Male  :1266475   Min.   :16.00   16-19:187734   Married:1441114  
 Employed          :1600882   Female:1377638   1st Qu.:31.00   20-29:422699   Married:      0  
 Unemployed        : 132344                    Median :45.00   30-39:431298   Single :1202999  
 Not in labor force: 899464                    Mean   :45.81   40-49:490533   Single :      0  
                                               3rd Qu.:59.00   50-60:518633   Single :      0  
                                               Max.   :85.00   61+  :593216   Single :      0  

                             Education          State              Year          Month       
 Less than a high school diploma  :418636   CA     : 221244   Min.   :2008   Min.   : 1.000  
 High school graduates, no college:802141   TX     : 132650   1st Qu.:2008   1st Qu.: 4.000  
 Some college or associate degree :719492   NY     : 114282   Median :2009   Median : 6.000  
 Bachelor's degree or higher      :703844   FL     : 106116   Mean   :2009   Mean   : 6.385  
                                            PA     :  82482   3rd Qu.:2009   3rd Qu.: 9.000  
                                            IL     :  80816   Max.   :2010   Max.   :12.000  
                                            (Other):1906523                                  
     YYYYMM           Weight     
 Min.   :200804   Min.   :    0  
 1st Qu.:200810   1st Qu.: 1176  
 Median :200904   Median : 2496  
 Mean   :200887   Mean   : 2226  
 3rd Qu.:200910   3rd Qu.: 3139  
 Max.   :201004   Max.   :16822  

2 个答案:

答案 0 :(得分:2)

您是否考虑过使用plyr软件包,特别是ddply?您将数据框投入其中,以独特的时间戳为中心。所以你得到类似的东西:

unemployment_rate.df <- ddply(.data = df,
                              .variables = "YYYYMM",
                              .fun = function(x){
                                return(sum(x$weight[x$Empst== "unemployed"])/sum(x$weight[|x$Empst== "Not in labor force"]))

这应该做的是浏览每个唯一年度组合的数据集并执行失业计算,返回如下数据集:

YYYYMM V1
200812 0.13
200901 0.1
200902 0.43

如果目标是加速你的for循环,另一种获取它的方法(你应该将它应用于for循环)通常是预先指定输出向量的长度,如果你知道的话。因此,要使用此示例,您知道您将拥有与unique(df $ YYYYMM)相同长度的输出向量 - 因此,如果您事先指定,那么循环应该移动得更快,因为R不再必须在每次迭代时扩展向量 - 它只是修改现有(空白)元素。

你也可以避免以这种方式分配/追加,占用时间--R会话必须为每次迭代减少一些空间 - 只需分配给output_vector [i] 。所以,通过这个例子,你会得到一些看起来像的东西。

#Create an output vector. We can specify length, because we know there'll
#be one entry for each unique value in the YYYYMM column.
#That saves time because it means R just modifies the vector in place.
UnR <- numeric(length(unique(df$YYYYMM))

#And now, the for loop.
for(i in levels(factor(df$YYYYMM))){

  #Instead of creating a temporary object (which takes time), and then appending
  #(which takes time), we can just assign the result to the Ith element of the
  #output vector.
  UnR[i]<-sum(df[df$Empst=="Unemployed" & df$YYYYMM == i,]$Weight) /
        sum(df[df$Empst %in% c("Employed","Unemployed") & df$YYYYMM == i,]$Weight)
}

那应该快得多。 Plyr可能比那更快(我没有对它进行基准测试),但是这些for循环改进是通用的,所以我认为我花时间对它们进行抒情。当人们说R中的for循环很慢时,它们意味着&#34;对于具有未知长度输出的循环来说很慢&#34;或者&#34;对于具有非原始数据类型的循环很慢&#34; - 他们是对的。但是对于这样的操作,完全可以创建一个高性能的循环。

答案 1 :(得分:1)

您可以使用dplyr执行此操作,有点类似于plyr方法。

require(dplyr)
df %.%
    group_by(YYYYMM) %.%
    summarize(UnR = sum(Weight[Empst == "Employed"]) /
                    sum(Weight[Empst %in% c("Employed", "Unemployed")]))

dplyr几乎肯定会比plyr更快,但除非您的数据非常大,否则您可能不会注意到差异。