我想从Migration Policy Institute抓取多个页面,该页面具有选定美国县的移民概况。然后,我想以长格式在csv中收集此数据以供以后处理。最简单/最有效的方法是什么?
每个页面都有一个包含3列的表格:描述性文字,估算值和组百分比。这些由副标题分隔。县由附加在网址末尾的“ geoid”标识。
我是R的新手,并且一直在搜索在线论坛,但现在我比开始时更加困惑。我的代码很基本,我不知道从这里开始。任何建议,不胜感激!
library(rvest)
library(dplyr)
url_base <- "https://www.migrationpolicy.org/data/unauthorized-immigrant-population/county/%d"
GEOID <- c(4013, 4019, 4027, 4021, 5007, 5143, 6037, 6059, 6073, 6065, 6071, 6085, 6001,
6019, 6029, 6069, 6013, 6111, 6067, 6081, 6077, 6107, 6075, 6083, 6099, 6097, 6047,
6095, 6087, 6025, 6039, 6041, 6031, 6113, 6055, 6079, 8001, 8005, 8035, 8059, 8013,
8014, 8047, 8019, 8039, 8031, 8041, 9001, 9009, 9003, 10003, 12086, 12011, 12099,
12095, 12057, 12071, 12021, 12105, 12103, 12031, 12089, 12081, 12097, 13135,
13089, 13067, 13121, 13139, 13063, 15000, 17031, 17089, 17043, 17097, 17197,
18097, 19153, 20091, 20015, 20173, 20209, 21111, 21067, 22051, 22075, 22087,
24031, 24033, 24005, 24510, 24003, 24027, 25025, 25027, 26163, 26125, 26081,
27053, 27123, 29095, 29037, 29189, 31055, 32003, 32031, 34017, 34023, 34013,
34003, 34017, 34031, 34021, 34025, 34027, 34035, 34007, 34029, 34001, 35001,
35013, 36081, 36047, 36005, 36061, 36119, 36103, 36059, 36085, 36087, 36071,
37119, 37183, 37081, 37063, 37067, 37135, 39049, 39061, 40109, 40143, 41147,
41051, 41047, 42101, 42091, 42029, 44007, 45045, 45013, 45053, 45051, 47037,
47157, 47093, 47001, 47173, 48201, 48113, 48215, 48439, 48453, 48029, 48141,
48061, 48479, 48339, 48085, 48157, 48121, 48491, 48039, 48453, 48493, 49035,
49049, 51059, 51153, 51107, 51013, 51510, 51087, 53033, 53077, 53061, 53053, 55079, 55025)
UAprofiles <- for(i in GEOID) {
cat("----------")
pg <- read_html(sprintf(url_base, i))
table_i<-html_node(html_table(pg, "table#unauthorized-table.datasheet"))
}
str(UAprofiles)
#i get this error message: "Error in if (header) { : argument is not interpretable as logical"
如果有帮助,我还会列出每个标题和子标题的选择器列表:
#unauthorized-table.datasheet-heading
#topcountries.datasheet-subheading
#regionsbirth.datasheet-subheading
#yearsresidence.datasheet-subheading
#age.datasheet-subheading
#gender.datasheet-subheading
#family.datasheet-heading
#parental.datasheet-subheading
#marital.datasheet-subheading
#education.datasheet-heading
#schoolenrollment.datasheet-subheading
#adulteducation.datasheet-subheading
#english.datasheet-subheading
#languages.datasheet-subheading
#workforce.datasheet-heading
#laborforce.datasheet-subheading
#industries.datasheet-subheading
#economics.datasheet-heading
#income.datasheet-subheading
#healthinsurance.datasheet-subheading
#homeownership.datasheet-subheading