我正在尝试从以下链接中抓取表5中对应的数据:https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/5tabledatadecpdf/table_5_crime_in_the_united_states_by_state_2013.xls
正如所建议的,我使用SelectorGadget来查找相关的CSS匹配,我发现包含所有数据(以及一些无关信息)的那个是“#page_content”
我尝试过以下代码,产生错误:
fbi <- read_html("https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/5tabledatadecpdf/table_5_crime_in_the_united_states_by_state_2013.xls")
fbi %>%
html_node("#page_content") %>%
html_table()
Error: html_name(x) == "table" is not TRUE
#Try extracting only the first column:
fbi %>%
html_nodes(".group0") %>%
html_table()
Error: html_name(x) == "table" is not TRUE
#Directly feed fbi into html_table
data = fbi %>% html_table(fill = T)
#This output creates a list of 3 elements, where within list 1 and 3, there are many missing values.
非常感谢任何帮助!
答案 0 :(得分:1)
您可以直接下载excel文件。之后,您应该查看excel文件并将所需数据转换为csv文件。之后,您可以处理数据。以下是执行此操作的代码。
library(rvest)
library(stringr)
page <- read_html("https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/5tabledatadecpdf/table_5_crime_in_the_united_states_by_state_2013.xls")
pageAdd <- page %>%
html_nodes("a") %>% # find all links
html_attr("href") %>% # get the url
str_subset("\\.xls") %>% # find those that end in xls
.[[1]]
mydestfile <- "D:/Kumar/table5.xls" # change the path and file name as per your system
download.file(pageAdd, mydestfile, mode="wb")
数据格式不是很好。因此在R中下载它会更加混乱。对我而言,这似乎是解决问题的最佳方法。