R改变变量的因子级别,并删除旧的

时间:2014-11-09 03:17:38

标签: r spss levels

我有一个大数据集,从SPSS文件中读取。它包含多个行和列,可从许多小SPSS个文件中读取。 SPSS文件包含一些错误,我想在R中更正。当读取数据时,它会在因子级别中产生所有噪声,但SPSS中的数据是正常的。我无法在SPSS中更改许多单个文件中的因子级别。以下是我的小数据样本

data
    a  b                   c                  d    e
[1] 3  5 1 Very dissatisfied                  5    5
[2] 8  3                  10         Don't Know    1
[3] 7  5                   3                  8    6
[4] 3  5                   9                  6   99
[5] 9  4                   8  10 Very Satisfied    3
[6] 5 NA       99 Don't Know     Very Satisfied   10

levels(data[,1])
 [1] "1 Very Dissatisfied" "2"                 "3"             "4"                
 [5] "5"                   "6"                 "7"             "8"                
 [9] "9"                   "1" "10 Very Satisfied" "99 Don't know"
[12] "1 Very Bad"        "99"       "2 Satisfied"             "10"

这些关卡包含很多错误。我想将它们纠正为以下内容

x<-factor()
x<-ordered(x,levels=c("1 Very Dissatisfied","2 Satisfied","3 Satisfied","4 Satisfied",
"5 Satisfied","6 Satisfied","7 Satisfied","8 Satisfied","9 Satisfied","10 Very Satisfied",
"99 Dont Know"))

levels(x)
[1] "1 Very Dissatisfied"  "2 Satisfied"         "3 Satisfied"    "4 Satisfied"      
[5] "5 Satisfied"          "6 Satisfied"         "7 Satisfied"    "8 Satisfied"      
[9] "9 Satisfied"          "10 Very Satisfied"  "99 Dont Know"

我尝试了以下代码

for(j in c(1,2,5)){
    data[,j] <- factor(data[,j], levels = c(levels(data[,j]), levels(x)))
    for(i in 2:9){
        data[grep(i,data[,j]),j] <- paste(i,"Satisfied")}
}

这不起作用。请告诉我我错在哪里,我应该怎么做。

即使在此代码有效之后,我也必须删除变量包含的未使用的垃圾因子。怎么做?

3 个答案:

答案 0 :(得分:2)

  1. 清理您的数据。这只会留下数字和NA。

    data=apply(data,1:2,function(x) gsub("[^0-9]", "",x))
    

    数据将是这样的:

          a   b   c    d    e   
    
    [1,] "3" "5" "1"  "5"  "5"     
    [2,] "8" "3" "10" "99" "1"   
    [3,] "7" "5" "3"  "8"  "6"   
    [4,] "3" "5" "9"  "6"  "99"  
    [5,] "9" "4" "8"  "10" "3"   
    [6,] "5" NA  "99" "10" "10"  
    
  2. 重新编码您的字符串。

    # Install the car package
    install.packages("car")
    
    
    # Load the car package     
    library("car")
    
    replace_string=function(x) {  
    recode(x,'1="1 Very Dissatisfied";  
              2="2 Satisfied";  
              3="3 Satisfied";  
              4="4 Satisfied";   
              5="5 Satisfied";  
              6="6 Satisfied";  
              7="7 Satisfied";  
              8="8 Satisfied";  
              9="9 Satisfied";  
             10="10 Very Satisfied";   
             99="99 Dont Know"')  
     }  
    
     data=apply(data,1:2,replace_string)  
    

答案 1 :(得分:1)

我建议不要使用SPSS中的值标签来保留SPSS属性:

temp <- read.spss(file, use.value.labels = FALSE)

然后我会使用ifelse根据您的for循环更正标签:

temp$c <- ifelse(as.numeric(temp$c) %in% 1:9, paste(temp$c, "Satisfied", sep=" "), temp$c)

答案 2 :(得分:0)

我犯错的地方是grep。我使用grep(^i$,data)代替grep(i,data)。这捕获了1和10,还有9和99.我使用^i$来完全匹配角色,以便^9$仅捕获9而不是99.

要删除因子中未使用的级别并将其用作序数变量,我最后使用ordered(data)来解决问题。

我使用了以下完整代码来纠正自己:

步骤1:定义因子的水平

x<-factor()
x<-ordered(x,levels=c("1 Very Dissatisfied","2 Satisfied","3 Satisfied","4 Satisfied","5 Satisfied","6 Satisfied","7 Satisfied","8 Satisfied","9 Satisfied","10 Very Satisfied","Dont Know"))

步骤2:现在循环遍历所有数据列和行。

我使用了以下代码:

for(j in c(28,29,32)){
    data[,j]<-factor(data[,j])
    #add required levels so that when introduced later, does not introduce NA
    data[,j] <- factor(data[,j], levels = c(levels(data[,j]), levels(x)))
    #Now remove and correct noises
    data[grep("99",data[,j]),j] <- "Dont Know"
    data[grep("Don",data[,j]),j] <- "Dont Know"
    data[grep("Very [Ss]",data[,j]),j] <- "10 Very Satisfied"
    data[grep("10",data[,j]),j] <- "10 Very Satisfied"
    data[grep("Very [Dd]",data[,j]),j] <- "1 Very Dissatisfied"
    data[grep("^1$",data[,j]),j] <- "1 Very Dissatisfied"
    #Loop through remaining data and correct
    for(i in 2:9){
       data[grep(paste("^",i,"$",sep=""),data[,j]),j] <- paste(i,"Satisfied")
    }
    #to remove unused factors, ordered
    data[,j]<-ordered(data[,j],levels(x))
}