Question

我在R中练习我的网页抓取编码，无论我尝试什么网站，我都无法通过一个阶段。

例如，

https://www.thecompleteuniversityguide.co.uk/league-tables/rankings?s=Music

我的目标是提取所有77所学校的名称（牛津大学到伦敦大都会）

所以我试过......

library(rvest)
url_college <- "https://www.thecompleteuniversityguide.co.uk/league-tables/rankings?s=Music"
college <- read_html(url_college)
info <- html_nodes(college, css = '.league-table-institution-name')
info %>% html_nodes('.league-table-institution-name') %>% html_text()

从F12开始，我可以发现所有学校的名字都在班级'.league-table-institution-name'...这就是为什么我在html_nodes中写的...

我做错了什么？

Answer 1

您似乎正在运行html_nodes()两次：首先是college，xml_document（这是正确的），然后是info，这是一个字符向量正确的。

请改为尝试：

url_college %>%
  read_html() %>%
  html_nodes('.league-table-institution-name') %>%
  html_text()

然后你需要额外的一步来清理学校的名字;建议这个：

%>%
  str_replace_all("(^[^a-zA-Z]+)|([^a-zA-Z]+$)", "")

R中的网络抓取

1 个答案: