从具有R的非html网站中删除表格,但显示的示例适用于hmtl

时间:2019-03-07 14:03:00

标签: r web-scraping

我有一个问题。我正在尝试从非HTML网站中删除这两个表。 这是网站:

https://www.datadictionary.nhs.uk/web_site_content/supporting_information/main_specialty_and_treatment_function_codes_table.asp

但是,我正在遵循一些我不应该做的事情,但是没有找到任何答案。这是我尝试过的:

library(tidyverse)
library(rvest)
library(XML)
library(httr)



url <- "https://www.datadictionary.nhs.uk/web_site_content/supporting_information/main_specialty_and_treatment_function_codes_table.asp"

poptable <- readHTMLTable(url, which = 1)

并得到此错误:

  

错误(函数(类,fdef,mtable)):无法找到   函数“ readHTMLTable”的继承方法,用于签名““ NULL””   另外:警告消息:XML内容似乎不是XML:   'https://www.datadictionary.nhs.uk/web_site_content/supporting_information/main_specialty_and_treatment_function_codes_table.asp'

我认为无论ASP网站类型如何,我仍然可以使用readHTMLTable函数。有没有其他选择。我还没有找到任何东西,并且奋斗了几个小时才找到东西。

1 个答案:

答案 0 :(得分:3)

实际上,这很简单(基于@lukeA's answer):

library(rvest)

url <- "https://www.datadictionary.nhs.uk/web_site_content/supporting_information/main_specialty_and_treatment_function_codes_table.asp"

page <- read_html(url)
nodes <- html_nodes(page, "table") # you can use Selectorgadget to identify the node
table <- html_table(nodes[[1]]) # each element of the nodes list is one table that can be extracted
head(table)
                                       Code  Main Specialty Title
1 Surgical Specialties Surgical Specialties  Surgical Specialties
2                                       100       GENERAL SURGERY
3                                       101               UROLOGY
4                                       110 TRAUMA & ORTHOPAEDICS
5                                       120                   ENT
6                                       130         OPHTHALMOLOGY

Selectorgadget可以在这里安装:Selectorgadget by Hadley Wickham