我有一个非常大的数据集,为方便起见,我正在为其创建一个假数据集。我有4个州,5年,每个州2个类型和值。我想获取每个州,年份和类型的值的总和。
如果我运行for和which循环,则无法获得所需的值。我想知道是否有人知道解决方案
StateName<-c("a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","c","c","c","c","c","c","c","c","c","c","c","c","c","c","c","c","c","c","c","c","d","d","d","d","d","d","d","d","d","d","d","d","d","d","d","d","d","d","d","d")
Year<- rep(1966:1970, times=16)
Type<-c("Y", "Y", "Y", "Y","Y","Y", "Y", "Y", "Y","Y", "Z", "Z", "Z","Z","Z","Z", "Z", "Z","Z","Z","Y", "Y", "Y", "Y","Y","Y", "Y", "Y", "Y","Y", "Z", "Z", "Z","Z","Z","Z", "Z", "Z","Z","Z","Y", "Y", "Y", "Y","Y","Y", "Y", "Y", "Y","Y", "Z", "Z", "Z","Z","Z","Z", "Z", "Z","Z","Z","Y", "Y", "Y", "Y","Y","Y", "Y", "Y", "Y","Y", "Z", "Z", "Z","Z","Z","Z", "Z", "Z","Z","Z")
Value<-rep(1:4, times=20)
Test_Data<-cbind(StateName, Year, Type, Value)
Test_Data<-data.frame(Test_Data)
New_Table<-cbind(unique(StateName), 1966:1967, NA, NA)
New_Table<-data.frame(New_Table)
colnames(New_Table)<-c("State", "Year", "AA_Sum", "BB_Sum")
for(A in 1:nrow(Test_Data)){
temp_index = which(as.character(Test_Data$StateName[A]) %in% as.character(New_Table$State) &
Test_Data$Year[A] %in% New_Table$Year &
Test_Data$Value[A] == "AA" )
New_Table$AA_Sum<- sum(Test_Data$Value[temp_index])
}
当前,我收到一个错误“ Summary.factor(integer(0),na.rm = TRUE)中的错误: “总和”对因素没有意义”
我想知道是否有人知道如何用每个州和年份的Y的总和,以及类似地,每个州和年份的Z的总和来填充New_Table中的数据
答案 0 :(得分:1)
正如Richard正确指出的那样,您可以使用plyr
或dplyr
来解决这个问题:
library(dplyr)
Test_Data %>% group_by(StateName, Year) %>% summarise(AA_Sum=sum(Value)
您收到的错误是由于Test_Data $ Value是一个因素。为什么?您制作data.frame的程序:
Test_Data<-cbind(StateName, Year, Type, Value)
将四个向量绑定到一个矩阵中。矩阵的所有列/行都具有相同的数据类型。由于您要绑定一个字符,因此结果是一个字符向量。观察:
> str(cbind(StateName, Year, Type, Value))
chr [1:4, 1:4] "a" "b" "c" "d" "1966" "1967" "1966" "1967" NA NA NA NA NA NA NA NA
将其转换为data.frame时,其默认行为是将字符向量转换为因数。糟透了使用参数stringsAsFactor=FALSE
可以避免这种行为。 (另外,请检查功能str
,这对于调查对象确实很有帮助。)
您可以单行获得预期结果:
Test_Data <- data.frame(StateName=StateName, Year=rep(1966:1970, times=16), Type=Type, Value=rep(1:4, times=20))
最后,您的for循环不会执行您期望的操作。 a)tempindex
将最多返回 返回整数1
,但大多数返回的只是长度为零的向量,因此将返回错误的integer(0)
部分。 b)您正在遍历Test_Data
中的所有行,但是尝试总结在New_Table
中发现的事件。循环的最后一行New_Table$AA_Sum<- ...
会使用当前的总和简单地覆盖整个列。
您可能想做的是(如果您忽略其他答案):
for (i in 1:nrow(New_Table)) {
tempindex <- which(Test_Data$StateName == New_Table$StateName[i] & ...)
New_Table$AA_Sum[i] <- sum(Test_Data$Value[tempindex])
}
我已经排除了一些练习代码。检查每个tempindex
处i
的值,并根据需要扩展表达式。