如何在R中使用gsub将变量的缺失值替换为变量的中位数?

时间:2019-04-10 01:18:52

标签: r regex gsub

我有一个数据框,该数据框是从Wikipedia页面表的html文件中提取的。我想将缺失值替换为每个变量的中位数。

根据给出的提示,我知道我需要将factor类型转换为numeric值,并且可能需要使用as.numeric(gsub())

renew$Hydro[grep('\\s', renew$Hydro)]
as.numeric(gsub('', median(as.numeric(renew$Hydro)), renew$Hydro))
lapply(renew, function(x) as.numeric(gsub('', median(as.numeric(x)), x)))

我尝试使用grep()来表明'\\s'是提取空格的模式,但是实际上从输出中排除了空格,只显示了数字。

当我尝试使用as.numeric(gsub())时,输出如下:

[1] 5.415405e+13 5.475475e+13 5.475425e+07 5.475415e+13 5.400000e+01 5.400000e+01 5.435405e+16
[8] 5.425435e+13 5.400000e+01 5.415455e+16 5.445425e+16 5.415495e+13 5.400000e+01 5.400000e+01

完全不同于看起来像这样的数据框:

[1] 1035.3   7782     72       7109                       30134.8  2351.2            15318   

我希望输出看起来与原始数据帧完全一样,但是用中位数填充空格。

编辑: 这就是数据帧的开头。来自“ {https://en.wikipedia.org/wiki/List_of_countries_by_electricity_production_from_renewable_sources”。

> renew
                             Country    Hydro     Wind     Bio   Solar
1                        Afghanistan   1035.3      0.1            35.5
2                            Albania     7782                      1.9
3                            Algeria       72     19.4           339.1
4                             Angola     7109              155    18.3
5                           Anguilla                               2.4
6                Antigua and Barbuda                               5.5
7                          Argentina  30134.8    554.1  1820.4    14.5
8                            Armenia   2351.2      1.8             1.2
9                              Aruba             130.3     8.9     9.2
10                         Australia    15318    12199    3722    6209
11                           Austria    42919     5235    4603    1096
12                        Azerbaijan   1959.3     22.8   174.5    35.3
13                           Bahamas                               1.9
14                           Bahrain               1.2             8.3
15                        Bangladesh      946      5.1     7.7   224.3

3 个答案:

答案 0 :(得分:1)

由于数据框中有空白,因此列被转换为字符,并且取median字符列没有意义。我们可以先将空白替换为NA,将列转换为数字,然后将replace NA与列的median进行转换。使用dplyr,我们可以执行以下步骤。

library(dplyr)
renew[renew == ""] <- NA

renew %>%
   mutate_at(-1, as.numeric) %>% #-1 is to ignore Country column
   mutate_at(-1, ~ replace(., is.na(.), median(., na.rm = TRUE)))


#             Country   Hydro    Wind    Bio  Solar
#1        Afghanistan  1035.3     0.1  174.5   35.5
#2            Albania  7782.0    21.1  174.5    1.9
#3            Algeria    72.0    19.4  174.5  339.1
#4             Angola  7109.0    21.1  155.0   18.3
#5           Anguilla  4730.1    21.1  174.5    2.4
#6  AntiguaandBarbuda  4730.1    21.1  174.5    5.5
#7          Argentina 30134.8   554.1 1820.4   14.5
#8            Armenia  2351.2     1.8  174.5    1.2
#9              Aruba  4730.1   130.3    8.9    9.2
#10         Australia 15318.0 12199.0 3722.0 6209.0
#11           Austria 42919.0  5235.0 4603.0 1096.0
#12        Azerbaijan  1959.3    22.8  174.5   35.3
#13           Bahamas  4730.1    21.1  174.5    1.9
#14           Bahrain  4730.1     1.2  174.5    8.3
#15        Bangladesh   946.0     5.1    7.7  224.3

我们可以使用基数R

renew[renew == ""] <- NA
renew[-1] <- lapply(renew[-1], function(x) 
      as.numeric(replace(x, is.na(x), median(as.numeric(x), na.rm = TRUE))))

答案 1 :(得分:0)

我们可以使用na.aggregate中的zoo以紧凑的方式做到这一点

library(dplyr)
library(hablar)
library(zoo)
renew %>%
    retype %>% # change the type of columns
    # replace missing value of numeric columns with median
     mutate_if(is.numeric, na.aggregate, FUN = median)
# A tibble: 15 x 5
#   Country              Hydro    Wind    Bio  Solar
#   <chr>                <dbl>   <dbl>  <dbl>  <dbl>
# 1 Afghanistan          1035.     0.1  174.    35.5
# 2 Albania              7782     21.1  174.     1.9
# 3 Algeria                72     19.4  174.   339. 
# 4 Angola               7109     21.1  155     18.3
# 5 Anguilla             4730.    21.1  174.     2.4
# 6 Antigua and Barbuda  4730.    21.1  174.     5.5
# 7 Argentina           30135.   554.  1820.    14.5
# 8 Armenia              2351.     1.8  174.     1.2
# 9 Aruba                4730.   130.     8.9    9.2
#10 Australia           15318  12199   3722   6209  
#11 Austria             42919   5235   4603   1096  
#12 Azerbaijan           1959.    22.8  174.    35.3
#13 Bahamas              4730.    21.1  174.     1.9
#14 Bahrain              4730.     1.2  174.     8.3
#15 Bangladesh            946      5.1    7.7  224. 

数据

renew <- structure(list(Country = c("Afghanistan", "Albania", "Algeria", 
"Angola", "Anguilla", "Antigua and Barbuda", "Argentina", "Armenia", 
"Aruba", "Australia", "Austria", "Azerbaijan", "Bahamas", "Bahrain", 
"Bangladesh"), Hydro = c("1035.3", "7782", "72", "7109", "", 
"", "30134.8", "2351.2", "", "15318", "42919", "1959.3", "", 
"", "946"), Wind = c("0.1", "", "19.4", "", "", "", "554.1", 
"1.8", "130.3", "12199", "5235", "22.8", "", "1.2", "5.1"), Bio = c("", 
"", "", "155", "", "", "1820.4", "", "8.9", "3722", "4603", "174.5", 
"", "", "7.7"), Solar = c(35.5, 1.9, 339.1, 18.3, 2.4, 5.5, 14.5, 
1.2, 9.2, 6209, 1096, 35.3, 1.9, 8.3, 224.3)), row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15"), class = "data.frame")

答案 2 :(得分:0)

我想指出的是,由于lapply(renew, function(x) grep(",", x))会产生某些结果,因此数据在刮后还不是干净的。

请先使用gsub对其进行清洁,以避免在将数据转换为数值时将这些值转换为NA。这是一个一步的解决方案,可以自动创建正确的NA

renew[-1] <- lapply(renew[-1], function(x) as.numeric(as.character(gsub(",", ".", x))))

之后,您可以运行

# sapply(2:5, function(x) renew[[x]][is.na(renew[[x]])] <<- median(renew[[x]], na.rm=TRUE))

或者当然是 @Ronak Shah的第二基本R代码行的更短改编,它要好得多:

renew[-1] <- sapply(renew[-1], function(x) replace(x, is.na(x), median(x, na.rm=TRUE)))

结果

summary(renew)
#                      country        hydro                wind                bio              solar        
# Afghanistan        :  1   Min.   :      0.8   Min.   :     0.00   Min.   :    0.2   Min.   :    0.1  
# Albania            :  1   1st Qu.:    907.8   1st Qu.:    50.45   1st Qu.:  151.1   1st Qu.:    4.8  
# Algeria            :  1   Median :   2595.0   Median :   109.00   Median :  242.5   Median :   22.3  
# Angola             :  1   Mean   :  19989.3   Mean   :  4324.13   Mean   : 2136.3   Mean   : 1483.3  
# Anguilla           :  1   3rd Qu.:   7992.4   3rd Qu.:   293.55   3rd Qu.:  344.4   3rd Qu.:  124.5  
# Antigua and Barbuda:  1   Max.   :1193370.0   Max.   :242387.70   Max.   :69017.0   Max.   :67874.1  
# (Other)            :209                                                                              

数据

library(rvest)
renew <- setNames(html_table(
  read_html(paste0("https://en.wikipedia.org/wiki/List_of_countries",
                   "_by_electricity_production_from_renewable_sources")),
  fill=TRUE, header=TRUE)[[1]][c(1, 6:9)], c("country", "hydro", "wind", "bio", "solar"))
renew$country <- factor(renew$country)