我有一个数据框,该数据框是从Wikipedia页面表的html文件中提取的。我想将缺失值替换为每个变量的中位数。
根据给出的提示,我知道我需要将factor
类型转换为numeric
值,并且可能需要使用as.numeric(gsub())
。
renew$Hydro[grep('\\s', renew$Hydro)]
as.numeric(gsub('', median(as.numeric(renew$Hydro)), renew$Hydro))
lapply(renew, function(x) as.numeric(gsub('', median(as.numeric(x)), x)))
我尝试使用grep()
来表明'\\s'
是提取空格的模式,但是实际上从输出中排除了空格,只显示了数字。
当我尝试使用as.numeric(gsub())
时,输出如下:
[1] 5.415405e+13 5.475475e+13 5.475425e+07 5.475415e+13 5.400000e+01 5.400000e+01 5.435405e+16
[8] 5.425435e+13 5.400000e+01 5.415455e+16 5.445425e+16 5.415495e+13 5.400000e+01 5.400000e+01
完全不同于看起来像这样的数据框:
[1] 1035.3 7782 72 7109 30134.8 2351.2 15318
我希望输出看起来与原始数据帧完全一样,但是用中位数填充空格。
编辑: 这就是数据帧的开头。来自“ {https://en.wikipedia.org/wiki/List_of_countries_by_electricity_production_from_renewable_sources”。
> renew
Country Hydro Wind Bio Solar
1 Afghanistan 1035.3 0.1 35.5
2 Albania 7782 1.9
3 Algeria 72 19.4 339.1
4 Angola 7109 155 18.3
5 Anguilla 2.4
6 Antigua and Barbuda 5.5
7 Argentina 30134.8 554.1 1820.4 14.5
8 Armenia 2351.2 1.8 1.2
9 Aruba 130.3 8.9 9.2
10 Australia 15318 12199 3722 6209
11 Austria 42919 5235 4603 1096
12 Azerbaijan 1959.3 22.8 174.5 35.3
13 Bahamas 1.9
14 Bahrain 1.2 8.3
15 Bangladesh 946 5.1 7.7 224.3
答案 0 :(得分:1)
由于数据框中有空白,因此列被转换为字符,并且取median
字符列没有意义。我们可以先将空白替换为NA
,将列转换为数字,然后将replace
NA
与列的median
进行转换。使用dplyr
,我们可以执行以下步骤。
library(dplyr)
renew[renew == ""] <- NA
renew %>%
mutate_at(-1, as.numeric) %>% #-1 is to ignore Country column
mutate_at(-1, ~ replace(., is.na(.), median(., na.rm = TRUE)))
# Country Hydro Wind Bio Solar
#1 Afghanistan 1035.3 0.1 174.5 35.5
#2 Albania 7782.0 21.1 174.5 1.9
#3 Algeria 72.0 19.4 174.5 339.1
#4 Angola 7109.0 21.1 155.0 18.3
#5 Anguilla 4730.1 21.1 174.5 2.4
#6 AntiguaandBarbuda 4730.1 21.1 174.5 5.5
#7 Argentina 30134.8 554.1 1820.4 14.5
#8 Armenia 2351.2 1.8 174.5 1.2
#9 Aruba 4730.1 130.3 8.9 9.2
#10 Australia 15318.0 12199.0 3722.0 6209.0
#11 Austria 42919.0 5235.0 4603.0 1096.0
#12 Azerbaijan 1959.3 22.8 174.5 35.3
#13 Bahamas 4730.1 21.1 174.5 1.9
#14 Bahrain 4730.1 1.2 174.5 8.3
#15 Bangladesh 946.0 5.1 7.7 224.3
我们可以使用基数R
renew[renew == ""] <- NA
renew[-1] <- lapply(renew[-1], function(x)
as.numeric(replace(x, is.na(x), median(as.numeric(x), na.rm = TRUE))))
答案 1 :(得分:0)
我们可以使用na.aggregate
中的zoo
以紧凑的方式做到这一点
library(dplyr)
library(hablar)
library(zoo)
renew %>%
retype %>% # change the type of columns
# replace missing value of numeric columns with median
mutate_if(is.numeric, na.aggregate, FUN = median)
# A tibble: 15 x 5
# Country Hydro Wind Bio Solar
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 Afghanistan 1035. 0.1 174. 35.5
# 2 Albania 7782 21.1 174. 1.9
# 3 Algeria 72 19.4 174. 339.
# 4 Angola 7109 21.1 155 18.3
# 5 Anguilla 4730. 21.1 174. 2.4
# 6 Antigua and Barbuda 4730. 21.1 174. 5.5
# 7 Argentina 30135. 554. 1820. 14.5
# 8 Armenia 2351. 1.8 174. 1.2
# 9 Aruba 4730. 130. 8.9 9.2
#10 Australia 15318 12199 3722 6209
#11 Austria 42919 5235 4603 1096
#12 Azerbaijan 1959. 22.8 174. 35.3
#13 Bahamas 4730. 21.1 174. 1.9
#14 Bahrain 4730. 1.2 174. 8.3
#15 Bangladesh 946 5.1 7.7 224.
renew <- structure(list(Country = c("Afghanistan", "Albania", "Algeria",
"Angola", "Anguilla", "Antigua and Barbuda", "Argentina", "Armenia",
"Aruba", "Australia", "Austria", "Azerbaijan", "Bahamas", "Bahrain",
"Bangladesh"), Hydro = c("1035.3", "7782", "72", "7109", "",
"", "30134.8", "2351.2", "", "15318", "42919", "1959.3", "",
"", "946"), Wind = c("0.1", "", "19.4", "", "", "", "554.1",
"1.8", "130.3", "12199", "5235", "22.8", "", "1.2", "5.1"), Bio = c("",
"", "", "155", "", "", "1820.4", "", "8.9", "3722", "4603", "174.5",
"", "", "7.7"), Solar = c(35.5, 1.9, 339.1, 18.3, 2.4, 5.5, 14.5,
1.2, 9.2, 6209, 1096, 35.3, 1.9, 8.3, 224.3)), row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15"), class = "data.frame")
答案 2 :(得分:0)
我想指出的是,由于lapply(renew, function(x) grep(",", x))
会产生某些结果,因此数据在刮后还不是干净的。
请先使用gsub
对其进行清洁,以避免在将数据转换为数值时将这些值转换为NA
。这是一个一步的解决方案,可以自动创建正确的NA
:
renew[-1] <- lapply(renew[-1], function(x) as.numeric(as.character(gsub(",", ".", x))))
之后,您可以运行
# sapply(2:5, function(x) renew[[x]][is.na(renew[[x]])] <<- median(renew[[x]], na.rm=TRUE))
或者当然是 @Ronak Shah的第二基本R代码行的更短改编,它要好得多:
renew[-1] <- sapply(renew[-1], function(x) replace(x, is.na(x), median(x, na.rm=TRUE)))
结果
summary(renew)
# country hydro wind bio solar
# Afghanistan : 1 Min. : 0.8 Min. : 0.00 Min. : 0.2 Min. : 0.1
# Albania : 1 1st Qu.: 907.8 1st Qu.: 50.45 1st Qu.: 151.1 1st Qu.: 4.8
# Algeria : 1 Median : 2595.0 Median : 109.00 Median : 242.5 Median : 22.3
# Angola : 1 Mean : 19989.3 Mean : 4324.13 Mean : 2136.3 Mean : 1483.3
# Anguilla : 1 3rd Qu.: 7992.4 3rd Qu.: 293.55 3rd Qu.: 344.4 3rd Qu.: 124.5
# Antigua and Barbuda: 1 Max. :1193370.0 Max. :242387.70 Max. :69017.0 Max. :67874.1
# (Other) :209
数据
library(rvest)
renew <- setNames(html_table(
read_html(paste0("https://en.wikipedia.org/wiki/List_of_countries",
"_by_electricity_production_from_renewable_sources")),
fill=TRUE, header=TRUE)[[1]][c(1, 6:9)], c("country", "hydro", "wind", "bio", "solar"))
renew$country <- factor(renew$country)