我正在尝试从以下url-中抓取数据: https://university.careers360.com/colleges/list-of-degree-colleges-in-India 我想单击每个学院的名称,并获取每个学院的特定数据。
首先,我要做的是将所有大学网址收集到一个vector-中:
#loading the package:
library(xml2)
library(rvest)
library(stringr)
library(dplyr)
#Specifying the url for desired website to be scrapped
baseurl <- "https://university.careers360.com/colleges/list-of-degree-colleges-in-India"
#Reading the html content from Amazon
basewebpage <- read_html(baseurl)
#Extracting college name and its url
scraplinks <- function(url){
#Create an html document from the url
webpage <- xml2::read_html(url)
#Extract the URLs
url_ <- webpage %>%
rvest::html_nodes(".title a") %>%
rvest::html_attr("href")
#Extract the link text
link_ <- webpage %>%
rvest::html_nodes(".title a") %>%
rvest::html_text()
return(data_frame(link = link_, url = url_))
}
#College names and Urls
allcollegeurls<-scraplinks(baseurl)
#Reading the each url
library(purrr)
allreadurls<-map(allcollegeurls$url, read_html)
现在可以正常工作了,但是当我编写以下代码时,它显示了一个错误。
#Specialization
#Using CSS selectors to scrap the specialization section
allcollegeurls$Specialization<-NA
for (i in allreadurls) {
allcollegeurls$Specialization[i] <- html_nodes(allreadurls[i][],'td:nth-
child(1)')
}
Error in allreadurls[i] : invalid subscript type 'list'
答案 0 :(得分:0)
我不确定所抓取的内容本身,但您可能希望将循环替换为
for (i in 1:length(allreadurls)) {
allcollegeurls$Specialization[i] <- html_nodes(allreadurls[i][],'td:nth-child(1)')
}
您的方法遇到的一个问题是i
的角色不一致:它在allreadurls
中取值,但随后又将其Specialization
和allreadurls
用作子集。另一个问题是
'td:nth-
child(1)'
最后,由于allreadurls
是一个列表,因此您想用[[i]]
而不是[i]
(又返回一个列表)来对其进行子集化。最后,不需要[]
。