从多个页面rvest抓取html表

时间:2018-07-04 22:16:29

标签: r html-table rvest scrape

我想从Migration Policy Institute抓取多个页面,该页面具有选定美国县的移民概况。然后,我想以长格式在csv中收集此数据以供以后处理。最简单/最有效的方法是什么?

每个页面都有一个包含3列的表格:描述性文字,估算值和组百分比。这些由副标题分隔。县由附加在网址末尾的“ geoid”标识。

我是R的新手,并且一直在搜索在线论坛,但现在我比开始时更加困惑。我的代码很基本,我不知道从这里开始。任何建议,不胜感激!

library(rvest)
library(dplyr)

url_base <- "https://www.migrationpolicy.org/data/unauthorized-immigrant-population/county/%d"
GEOID <- c(4013,    4019,   4027,   4021,   5007,   5143,   6037,   6059,   6073,   6065,   6071,   6085,   6001,
           6019,    6029,   6069,   6013,   6111,   6067,   6081,   6077,   6107,   6075,   6083,   6099,   6097,   6047,
           6095,    6087,   6025,   6039,   6041,   6031,   6113,   6055,   6079,   8001,   8005,   8035,   8059,   8013,
           8014,    8047,   8019,   8039,   8031,   8041,   9001,   9009,   9003,   10003,  12086,  12011,  12099,
           12095,   12057,  12071,  12021,  12105,  12103,  12031,  12089,  12081,  12097,  13135,
           13089,   13067,  13121,  13139,  13063,  15000,  17031,  17089,  17043,  17097,  17197,
           18097,   19153,  20091,  20015,  20173,  20209,  21111,  21067,  22051,  22075,  22087,
           24031,   24033,  24005,  24510,  24003,  24027,  25025,  25027,  26163,  26125,  26081,
           27053,   27123,  29095,  29037,  29189,  31055,  32003,  32031,  34017,  34023,  34013,
           34003,   34017,  34031,  34021,  34025,  34027,  34035,  34007,  34029,  34001,  35001,
           35013,   36081,  36047,  36005,  36061,  36119,  36103,  36059,  36085,  36087,  36071,
           37119,   37183,  37081,  37063,  37067,  37135,  39049,  39061,  40109,  40143,  41147,
           41051,   41047,  42101,  42091,  42029,  44007,  45045,  45013,  45053,  45051,  47037,
           47157,   47093,  47001,  47173,  48201,  48113,  48215,  48439,  48453,  48029,  48141,
           48061,   48479,  48339,  48085,  48157,  48121,  48491,  48039,  48453,  48493,  49035,
           49049,   51059,  51153,  51107,  51013,  51510,  51087,  53033,  53077,  53061,  53053,  55079,  55025)


UAprofiles <- for(i in GEOID) {

  cat("----------")
  pg <- read_html(sprintf(url_base, i))

table_i<-html_node(html_table(pg, "table#unauthorized-table.datasheet"))
} 

str(UAprofiles)

#i get this error message: "Error in if (header) { : argument is not interpretable as logical"

如果有帮助,我还会列出每个标题和子标题的选择器列表:

#unauthorized-table.datasheet-heading
  #topcountries.datasheet-subheading
  #regionsbirth.datasheet-subheading
  #yearsresidence.datasheet-subheading
  #age.datasheet-subheading
  #gender.datasheet-subheading
#family.datasheet-heading
  #parental.datasheet-subheading
  #marital.datasheet-subheading
#education.datasheet-heading
  #schoolenrollment.datasheet-subheading
  #adulteducation.datasheet-subheading
  #english.datasheet-subheading
  #languages.datasheet-subheading
#workforce.datasheet-heading
  #laborforce.datasheet-subheading
  #industries.datasheet-subheading
#economics.datasheet-heading
  #income.datasheet-subheading
  #healthinsurance.datasheet-subheading
  #homeownership.datasheet-subheading

0 个答案:

没有答案