Question

我有一个问题。我正在尝试从非HTML网站中删除这两个表。这是网站：

https://www.datadictionary.nhs.uk/web_site_content/supporting_information/main_specialty_and_treatment_function_codes_table.asp

但是，我正在遵循一些我不应该做的事情，但是没有找到任何答案。这是我尝试过的：

library(tidyverse)
library(rvest)
library(XML)
library(httr)



url <- "https://www.datadictionary.nhs.uk/web_site_content/supporting_information/main_specialty_and_treatment_function_codes_table.asp"

poptable <- readHTMLTable(url, which = 1)

并得到此错误：

错误（函数（类，fdef，mtable））：无法找到函数“ readHTMLTable”的继承方法，用于签名““ NULL”” 另外：警告消息：XML内容似乎不是XML： 'https://www.datadictionary.nhs.uk/web_site_content/supporting_information/main_specialty_and_treatment_function_codes_table.asp'

我认为无论ASP网站类型如何，我仍然可以使用readHTMLTable函数。有没有其他选择。我还没有找到任何东西，并且奋斗了几个小时才找到东西。

Answer 1

实际上，这很简单（基于@lukeA's answer）：

library(rvest)

url <- "https://www.datadictionary.nhs.uk/web_site_content/supporting_information/main_specialty_and_treatment_function_codes_table.asp"

page <- read_html(url)
nodes <- html_nodes(page, "table") # you can use Selectorgadget to identify the node
table <- html_table(nodes[[1]]) # each element of the nodes list is one table that can be extracted
head(table)
                                       Code  Main Specialty Title
1 Surgical Specialties Surgical Specialties  Surgical Specialties
2                                       100       GENERAL SURGERY
3                                       101               UROLOGY
4                                       110 TRAUMA & ORTHOPAEDICS
5                                       120                   ENT
6                                       130         OPHTHALMOLOGY

Selectorgadget可以在这里安装：Selectorgadget by Hadley Wickham

从具有R的非html网站中删除表格，但显示的示例适用于hmtl

1 个答案: