我是R languange的新手,我有一个任务,我应该在维基百科的html表格中显示一个数据箱图:
library("rvest")
library("ggplot2")
library("dplyr")
url <- "https://en.wikipedia.org/wiki/List_of_countries_by_oil_exports"
Countries <- url %>%
read_html() %>%
html_nodes(xpath='//*[@id="mw-content-text"]/table[2]') %>%
html_table(header=TRUE, fill=TRUE)
Countries <- Countries
head(Countries)
str(Countries)
for(i in 1:74){
Countries[i,3] = as.numeric(Countries[i,3])
}
#ggplot(Oil_Exports) + geom_boxplot() +
# ylab("Amount of oil Exports in (bbl/day)") +
# opts(title = "List of countries by oil exports")
如果我正确移动,我目前正在尝试将第3列中所有行的值更改为数字(Oil - exports(bbl / day))。我收到以下错误:
List of 1
$ :'data.frame': 74 obs. of 6 variables:
..$ Rank : int [1:74] 1 2 3 4 5 6 7 8 9 10 ...
..$ Country/Region : chr [1:74] "Saudi Arabia" "Russia" "Kuwait" "Iran" ...
..$ Oil - exports (bbl/day): chr [1:74] "6,880,000" "4,720,000" "2,750,000" "2,445,000" ...
..$ Date of
information : chr [1:74] "2011 est." "2013 est." "2016 est." "2011 est." ...
..$ Oil - exports (bbl/day): chr [1:74] "8,865,000" "7,201,000" "2,300,000" "1,808,000" ...
..$ Date of
information : int [1:74] 2012 2012 2012 2012 2016 2014 2012 2012 2012 2012 ...
Error in Countries[i, 3]: incorrect number of dimensions
Traceback:
如何解决问题,是否有更好的方法来解决?感谢。
答案 0 :(得分:2)
您的抓取脚本的输出是一个列表,而不是data.frame。我想你只想提取作为这个列表的第一个对象的data.frame。因此,只需添加Countries <- Countries[[1]]
library("rvest")
library("ggplot2")
library("dplyr")
url <- "https://en.wikipedia.org/wiki/List_of_countries_by_oil_exports"
Countries <- url %>%
read_html() %>%
html_nodes(xpath='//*[@id="mw-content-text"]/table[2]') %>%
html_table(header=TRUE, fill=TRUE)
Countries <- Countries[[1]]
但是,由于您的变量包含分隔数千的逗号,因此这不会开箱即用。让我们删除它们:
Countries[,3] <- gsub(",", "", Countries[,3])
此外,您不需要循环来转换变量:
Countries[,3] <- as.numeric(Countries[,3])
Countries[,3]