如何使用基于条件的中值数据替换R中的缺失数据

时间:2017-05-08 03:05:18

标签: r missing-data data-munging

我有来自https://drive.google.com/file/d/0B9YMMvghK2ytSXI4RFo0clNLc28/view

的数据

基本上是~600,000行的钻石数据集

它缺少一个列价格的值。我想用特定颜色的中位数价格替换NA值

summary(BigDiamonds)
##        X1             carat           cut               color          
##  Min.   :     1   Min.   :0.200   Length:598024      Length:598024     
##  1st Qu.:149507   1st Qu.:0.500   Class :character   Class :character  
##  Median :299013   Median :0.900   Mode  :character   Mode  :character  
##  Mean   :299013   Mean   :1.071                                        
##  3rd Qu.:448518   3rd Qu.:1.500                                        
##  Max.   :598024   Max.   :9.250                                        
##                                                                        
##    clarity              table           depth           cert          
##  Length:598024      Min.   : 0.00   Min.   : 0.00   Length:598024     
##  Class :character   1st Qu.:56.00   1st Qu.:61.00   Class :character  
##  Mode  :character   Median :58.00   Median :62.10   Mode  :character  
##                     Mean   :57.63   Mean   :61.06                     
##                     3rd Qu.:59.00   3rd Qu.:62.70                     
##                     Max.   :75.90   Max.   :81.30                     
##                                                                       
##  measurements           price             x                y         
##  Length:598024      Min.   :  300   Min.   : 0.150   Min.   : 1.000  
##  Class :character   1st Qu.: 1220   1st Qu.: 4.740   1st Qu.: 4.970  
##  Mode  :character   Median : 3503   Median : 5.780   Median : 6.050  
##                     Mean   : 8753   Mean   : 5.991   Mean   : 6.199  
##                     3rd Qu.:11174   3rd Qu.: 6.970   3rd Qu.: 7.230  
##                     Max.   :99990   Max.   :13.890   Max.   :13.890  
##                     NA's   :713     NA's   :1815     NA's   :1852    
##        z         
##  Min.   : 0.040  
##  1st Qu.: 3.120  
##  Median : 3.860  
##  Mean   : 4.033  
##  3rd Qu.: 4.610  
##  Max.   :13.180  
##  NA's   :2544

table(BigDiamonds$color)
## 
##     D     E     F     G     H     I     J     K     L 
## 73630 93483 93573 96204 86619 70282 48709 25868  9656



Diamonds2=BigDiamonds[is.na(BigDiamonds$price),]
Diamonds3=BigDiamonds[is.na(BigDiamonds$price)==F,]
library(Hmisc)
summarize(Diamonds3$price,Diamonds3$color,median)
##   Diamonds3$color Diamonds3$price
## 1               D            2690
## 2               E            2342
## 3               F            2966
## 4               G            3720
## 5               H            4535
## 6               I            4717
## 7               J            4697
## 8               K            4418
## 9               L            3017

我尝试了这个,但它不起作用

Diamonds21=select(Diamonds2,price,color,cut)

Diamonds21$newprice=ifelse(Diamonds21$color=="J",4697,Diamonds21$newprice)
Diamonds21$newprice<-ifelse(Diamonds21$color=="D",2690,Diamonds21$newprice)
Diamonds21$newprice<-ifelse(Diamonds21$color=="E",2342,Diamonds21$newprice)
Diamonds21$newprice<-ifelse(Diamonds21$color=="F",2966,Diamonds21$newprice)
Diamonds21$newprice<-ifelse(Diamonds21$color=="G",3720,Diamonds21$newprice)
Diamonds21$newprice<-ifelse(Diamonds21$color=="H",4535,Diamonds21$newprice)
Diamonds21$newprice<-ifelse(Diamonds21$color=="I",4717,Diamonds21$newprice)
Diamonds21$newprice<-ifelse(Diamonds21$color=="K",4418,Diamonds21$newprice)
Diamonds21$newprice<-ifelse(Diamonds21$color=="L",3017,Diamonds21$newprice) 

我的逻辑错了吗?

2 个答案:

答案 0 :(得分:1)

根据最适合您需求的方式,您可以通过几种不同的方式来解决这个问题。

首先,让我们设置一个缺少stackalloc值的钻石数据集:

val ppush = stackalloc[Ptr[czmq.zsock_t]]
val ppull = stackalloc[Ptr[czmq.zsock_t]]
!ppush = push
!ppull = pull
czmq.zsock_destroy(ppush)
czmq.zsock_destroy(ppull)

现在钻石数据集中有20%的price值缺失。

您可以使用library(dplyr) data(diamonds, package = "ggplot2") diamonds_missing <- diamonds %>% mutate(price = ifelse(sample(1:0, size = length(diamonds$price), replace = TRUE, prob = c(0.8, 0.2)), price, NA)) price将其替换为中位数:

mutate()

或者如果您愿意,可以使用tidyr包中的ifelse()函数:

diamonds_missing %>%
    mutate(price = ifelse(is.na(price), median(price, na.rm = TRUE), price))

#> # A tibble: 53,940 × 10
#>    carat       cut color clarity depth table price     x     y     z
#>    <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1   0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
#> 2   0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
#> 3   0.23      Good     E     VS1  56.9    65  2396  4.05  4.07  2.31
#> 4   0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
#> 5   0.31      Good     J     SI2  63.3    58  2396  4.34  4.35  2.75
#> 6   0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48
#> 7   0.24 Very Good     I    VVS1  62.3    57   336  3.95  3.98  2.47
#> 8   0.26 Very Good     H     SI1  61.9    55   337  4.07  4.11  2.53
#> 9   0.22      Fair     E     VS2  65.1    61   337  3.87  3.78  2.49
#> 10  0.23 Very Good     H     VS1  59.4    61   338  4.00  4.05  2.39
#> # ... with 53,930 more rows

答案 1 :(得分:0)

我试过这个并且有效

注意第一行是不同的

Diamonds21$newprice=ifelse(Diamonds21$color=="J",4697,Diamonds21$price)
Diamonds21$newprice<-ifelse(Diamonds21$color=="D",2690,Diamonds21$newprice)
Diamonds21$newprice<-ifelse(Diamonds21$color=="E",2342,Diamonds21$newprice)
Diamonds21$newprice<-ifelse(Diamonds21$color=="F",2966,Diamonds21$newprice)
Diamonds21$newprice<-ifelse(Diamonds21$color=="G",3720,Diamonds21$newprice)
Diamonds21$newprice<-ifelse(Diamonds21$color=="H",4535,Diamonds21$newprice)
Diamonds21$newprice<-ifelse(Diamonds21$color=="I",4717,Diamonds21$newprice)
Diamonds21$newprice<-ifelse(Diamonds21$color=="K",4418,Diamonds21$newprice)
Diamonds21$newprice<-ifelse(Diamonds21$color=="L",3017,Diamonds21$newprice)