我有以下数据集,有200万个观察值。数据是2008年4月至2010年4月期间。
> head(df)
Empst Gender Age Agegroup Marst Education State Year Month
1 Employed Female 58 50-60 Married Some college or associate degree AL 2008 12
2 Not in labor force Male 63 61+ Married Less than a high school diploma AL 2008 12
3 Employed Male 60 50-60 Single Some college or associate degree AL 2008 12
4 Not in labor force Male 55 50-60 Single High school graduates, no college AL 2008 12
5 Employed Male 36 30-39 Single Some college or associate degree AL 2008 12
6 Employed Female 42 40-49 Married Bachelor's degree or higher AL 2008 12
YYYYMM Weight
1 200812 1876.356
2 200812 2630.503
3 200812 2763.981
4 200812 2693.110
5 200812 2905.784
6 200812 3511.313
我想计算并绘制每月失业率。为计算失业率,我将失业人口的总和除以就业人口和失业人口的总和:
sum(df[df$Empst=="Unemployed",]$Weight) /
sum(df[df$Empst %in% c("Employed","Unemployed"),]$Weight)
要计算每月失业率,我使用for循环:
UnR<-vector()
for(i in levels(factor(df$YYYYMM))){
temp<-sum(df[df$Empst=="Unemployed" & df$YYYYMM == i,]$Weight) /
sum(df[df$Empst %in% c("Employed","Unemployed") & df$YYYYMM == i,]$Weight)
UnR<-append(UnR,temp)
rm(temp)
}
我的问题是:是否有另一种方法可以使用申请或类似的方式按月计算失业率?谢谢。以下是您需要时的数据集摘要。如果需要进一步澄清,请告诉我。
Empst Gender Age Agegroup Marst
Not in universe : 11423 Male :1266475 Min. :16.00 16-19:187734 Married:1441114
Employed :1600882 Female:1377638 1st Qu.:31.00 20-29:422699 Married: 0
Unemployed : 132344 Median :45.00 30-39:431298 Single :1202999
Not in labor force: 899464 Mean :45.81 40-49:490533 Single : 0
3rd Qu.:59.00 50-60:518633 Single : 0
Max. :85.00 61+ :593216 Single : 0
Education State Year Month
Less than a high school diploma :418636 CA : 221244 Min. :2008 Min. : 1.000
High school graduates, no college:802141 TX : 132650 1st Qu.:2008 1st Qu.: 4.000
Some college or associate degree :719492 NY : 114282 Median :2009 Median : 6.000
Bachelor's degree or higher :703844 FL : 106116 Mean :2009 Mean : 6.385
PA : 82482 3rd Qu.:2009 3rd Qu.: 9.000
IL : 80816 Max. :2010 Max. :12.000
(Other):1906523
YYYYMM Weight
Min. :200804 Min. : 0
1st Qu.:200810 1st Qu.: 1176
Median :200904 Median : 2496
Mean :200887 Mean : 2226
3rd Qu.:200910 3rd Qu.: 3139
Max. :201004 Max. :16822
答案 0 :(得分:2)
您是否考虑过使用plyr软件包,特别是ddply?您将数据框投入其中,以独特的时间戳为中心。所以你得到类似的东西:
unemployment_rate.df <- ddply(.data = df,
.variables = "YYYYMM",
.fun = function(x){
return(sum(x$weight[x$Empst== "unemployed"])/sum(x$weight[|x$Empst== "Not in labor force"]))
这应该做的是浏览每个唯一年度组合的数据集并执行失业计算,返回如下数据集:
YYYYMM V1
200812 0.13
200901 0.1
200902 0.43
如果目标是加速你的for循环,另一种获取它的方法(你应该将它应用于for循环)通常是预先指定输出向量的长度,如果你知道的话。因此,要使用此示例,您知道您将拥有与unique(df $ YYYYMM)相同长度的输出向量 - 因此,如果您事先指定,那么循环应该移动得更快,因为R不再必须在每次迭代时扩展向量 - 它只是修改现有(空白)元素。
你也可以避免以这种方式分配/追加,也占用时间--R会话必须为每次迭代减少一些空间 - 只需分配给output_vector [i] 。所以,通过这个例子,你会得到一些看起来像的东西。
#Create an output vector. We can specify length, because we know there'll
#be one entry for each unique value in the YYYYMM column.
#That saves time because it means R just modifies the vector in place.
UnR <- numeric(length(unique(df$YYYYMM))
#And now, the for loop.
for(i in levels(factor(df$YYYYMM))){
#Instead of creating a temporary object (which takes time), and then appending
#(which takes time), we can just assign the result to the Ith element of the
#output vector.
UnR[i]<-sum(df[df$Empst=="Unemployed" & df$YYYYMM == i,]$Weight) /
sum(df[df$Empst %in% c("Employed","Unemployed") & df$YYYYMM == i,]$Weight)
}
那应该快得多。 Plyr可能比那更快(我没有对它进行基准测试),但是这些for循环改进是通用的,所以我认为我花时间对它们进行抒情。当人们说R中的for循环很慢时,它们意味着&#34;对于具有未知长度输出的循环来说很慢&#34;或者&#34;对于具有非原始数据类型的循环很慢&#34; - 他们是对的。但是对于这样的操作,完全可以创建一个高性能的循环。
答案 1 :(得分:1)
您可以使用dplyr
执行此操作,有点类似于plyr
方法。
require(dplyr)
df %.%
group_by(YYYYMM) %.%
summarize(UnR = sum(Weight[Empst == "Employed"]) /
sum(Weight[Empst %in% c("Employed", "Unemployed")]))
dplyr
几乎肯定会比plyr
更快,但除非您的数据非常大,否则您可能不会注意到差异。