我有一个大数据集,从SPSS
文件中读取。它包含多个行和列,可从许多小SPSS
个文件中读取。 SPSS文件包含一些错误,我想在R中更正。当读取数据时,它会在因子级别中产生所有噪声,但SPSS中的数据是正常的。我无法在SPSS中更改许多单个文件中的因子级别。以下是我的小数据样本
data
a b c d e
[1] 3 5 1 Very dissatisfied 5 5
[2] 8 3 10 Don't Know 1
[3] 7 5 3 8 6
[4] 3 5 9 6 99
[5] 9 4 8 10 Very Satisfied 3
[6] 5 NA 99 Don't Know Very Satisfied 10
levels(data[,1])
[1] "1 Very Dissatisfied" "2" "3" "4"
[5] "5" "6" "7" "8"
[9] "9" "1" "10 Very Satisfied" "99 Don't know"
[12] "1 Very Bad" "99" "2 Satisfied" "10"
这些关卡包含很多错误。我想将它们纠正为以下内容
x<-factor()
x<-ordered(x,levels=c("1 Very Dissatisfied","2 Satisfied","3 Satisfied","4 Satisfied",
"5 Satisfied","6 Satisfied","7 Satisfied","8 Satisfied","9 Satisfied","10 Very Satisfied",
"99 Dont Know"))
levels(x)
[1] "1 Very Dissatisfied" "2 Satisfied" "3 Satisfied" "4 Satisfied"
[5] "5 Satisfied" "6 Satisfied" "7 Satisfied" "8 Satisfied"
[9] "9 Satisfied" "10 Very Satisfied" "99 Dont Know"
我尝试了以下代码
for(j in c(1,2,5)){
data[,j] <- factor(data[,j], levels = c(levels(data[,j]), levels(x)))
for(i in 2:9){
data[grep(i,data[,j]),j] <- paste(i,"Satisfied")}
}
这不起作用。请告诉我我错在哪里,我应该怎么做。
即使在此代码有效之后,我也必须删除变量包含的未使用的垃圾因子。怎么做?
答案 0 :(得分:2)
清理您的数据。这只会留下数字和NA。
data=apply(data,1:2,function(x) gsub("[^0-9]", "",x))
数据将是这样的:
a b c d e
[1,] "3" "5" "1" "5" "5"
[2,] "8" "3" "10" "99" "1"
[3,] "7" "5" "3" "8" "6"
[4,] "3" "5" "9" "6" "99"
[5,] "9" "4" "8" "10" "3"
[6,] "5" NA "99" "10" "10"
重新编码您的字符串。
# Install the car package
install.packages("car")
# Load the car package
library("car")
replace_string=function(x) {
recode(x,'1="1 Very Dissatisfied";
2="2 Satisfied";
3="3 Satisfied";
4="4 Satisfied";
5="5 Satisfied";
6="6 Satisfied";
7="7 Satisfied";
8="8 Satisfied";
9="9 Satisfied";
10="10 Very Satisfied";
99="99 Dont Know"')
}
data=apply(data,1:2,replace_string)
答案 1 :(得分:1)
我建议不要使用SPSS中的值标签来保留SPSS属性:
temp <- read.spss(file, use.value.labels = FALSE)
然后我会使用ifelse
根据您的for循环更正标签:
temp$c <- ifelse(as.numeric(temp$c) %in% 1:9, paste(temp$c, "Satisfied", sep=" "), temp$c)
答案 2 :(得分:0)
我犯错的地方是grep。我使用grep(^i$,data)
代替grep(i,data)
。这捕获了1和10,还有9和99.我使用^i$
来完全匹配角色,以便^9$
仅捕获9而不是99.
要删除因子中未使用的级别并将其用作序数变量,我最后使用ordered(data)
来解决问题。
我使用了以下完整代码来纠正自己:
步骤1:定义因子的水平
x<-factor()
x<-ordered(x,levels=c("1 Very Dissatisfied","2 Satisfied","3 Satisfied","4 Satisfied","5 Satisfied","6 Satisfied","7 Satisfied","8 Satisfied","9 Satisfied","10 Very Satisfied","Dont Know"))
步骤2:现在循环遍历所有数据列和行。
我使用了以下代码:
for(j in c(28,29,32)){
data[,j]<-factor(data[,j])
#add required levels so that when introduced later, does not introduce NA
data[,j] <- factor(data[,j], levels = c(levels(data[,j]), levels(x)))
#Now remove and correct noises
data[grep("99",data[,j]),j] <- "Dont Know"
data[grep("Don",data[,j]),j] <- "Dont Know"
data[grep("Very [Ss]",data[,j]),j] <- "10 Very Satisfied"
data[grep("10",data[,j]),j] <- "10 Very Satisfied"
data[grep("Very [Dd]",data[,j]),j] <- "1 Very Dissatisfied"
data[grep("^1$",data[,j]),j] <- "1 Very Dissatisfied"
#Loop through remaining data and correct
for(i in 2:9){
data[grep(paste("^",i,"$",sep=""),data[,j]),j] <- paste(i,"Satisfied")
}
#to remove unused factors, ordered
data[,j]<-ordered(data[,j],levels(x))
}