Question

每次我尝试计算此行“ DHS <-mean（ahebachelors2008）-mean（ahebachelors1992）”时，我都会收到不适用的答案。计算均值（ahe2008）有效，但计算均值（ahebachelors2008）不起作用。

setwd("~/Google Drive/R Data")
data <- read.csv('cps92_08.csv')
year <- data$year
year1992 <- subset(data,year<2000)
year2008 <- subset(data,year>2000)
ahe1992 <- (year1992$ahe)
ahe2008 <- (year2008$ahe)
max(ahe1992)
min(ahe1992)
mean(ahe1992)
median(ahe1992)
sd(ahe1992)
max(ahe2008)
min(ahe2008)
mean(ahe2008)
median(ahe2008)
sd(ahe2008)

adjahe <- ahe1992*(215.2/140.3)
max(adjahe)
min(adjahe)
mean(adjahe)
median(adjahe)
sd(adjahe)

D <- mean(ahe2008) - mean(adjahe)

education <- data$bachelor
ahebachelors1992 <- subset(adjahe, education>0)
ahehighschool1992 <- subset(adjahe,education<1)
ahebachelors2008 <- subset(ahe2008,education>0)
ahehighschool2008 <- subset(ahe2008,education<1)

DHS <- mean(ahebachelors2008) - mean(ahebachelors1992)

Answer 1

education与data的长度相同，而ahe2008是data的子集。因此，当您通过education作为ahe2008的条件时，它会创建NA（因为ahe2008中对应于这些元素的值。

这是一个更简单的示例：

d1<-c(1:5)
d2<-c(1:5,1:5)
subset(d1,d2==1)
[1]  1 NA

可能的解决方案是每年创建单独的bachelor向量，或者不连续子集化，而仅在需要时使用多个条件。

如果您想避免每次都输入完整的data$something，请考虑使用with()或更好的dplyr软件包。

例如，直到最后一行的所有代码都可以替换为该代码（假设我什么都没错过）：

DHS <- mean(with(data,ahe[year>2000 & education>0])) - 
       mean(with(data,ahe[year<2000 & education>0]*(215.2/140.3))

（如果您不熟悉R，请注意[]结构是调用子集的更简单方法）。

您可能还想考虑使用summary，它会为您提供最小值，中位数，均值和最大值，而您只剩下sd即可手动添加。：

summary(with(data,ahe[year>2000]))

Answer 2

如果您要计算mean的值包含NA，则输出将为NA。您可以通过在平均值上加上na.rm = TRUE来克服它：

DHS <- mean(ahebachelors2008, na.rm=TRUE) - mean(ahebachelors1992, na.rm=TRUE)

为什么计算均值时会得到NA？

2 个答案: