Question

我试图从this页面获取html表格，但我尝试了不同的方法，它们都失败了（看起来文档形成错误。

我试过这种方式：

library(XML)
x = readHTMLTable("https://www.jpmorganchasecc.com/results/search.php?city_id=16&search=1&gender=m&year=2015")

我收到了错误

XML似乎不是XML

然后我试着这样：

library(RCurl)
fileURL <- "(same link than before)"
xData <- getURL(fileURL)
doc <- xmlParse(xData)

我得到了

无法解析xmlns

所以我想知道我是否应该尝试找一种方法（也许正则表达式？）只收集表格代码然后解析它？

Answer 1

试试这个：

library(XML)
library(RCurl)

url <- "https://www.jpmorganchasecc.com/results/search.php?city_id=16&search=1&gender=m&year=2015"

tables <- getURL(url)
tables <- readHTMLTable(tables, stringsAsFactors = F)

#Shows you all the tables pulled
str(tables)

#To view a particular table
View(tables$results)

Answer 2

如果您使用rvest，那么您只需要定位正确的表格：

library(rvest)

URL <- "https://www.jpmorganchasecc.com/results/search.php?city_id=16&search=1&gender=m&year=2015"
pg <- read_html(URL)
dat <- html_table(html_nodes(pg, "table#results"))[[1]]

刮HTML表的指南

2 个答案: