我正在尝试从以下url-中抓取数据: https://university.careers360.com/colleges/list-of-degree-colleges-in-India 我想单击每个学院的名称,并获取每个学院的特定数据。
首先,我要做的是将所有大学网址收集到一个vector-中:
#loading the package:
library(xml2)
library(rvest)
library(stringr)
library(dplyr)
#Specifying the url for desired website to be scrapped
baseurl <- "https://university.careers360.com/colleges/list-of-degree-colleges-in-India"
#Reading the html content from Amazon
basewebpage <- read_html(baseurl)
#Extracting college name and its url
scraplinks <- function(url){
#Create an html document from the url
webpage <- xml2::read_html(url)
#Extract the URLs
url_ <- webpage %>%
rvest::html_nodes(".title a") %>%
rvest::html_attr("href")
#Extract the link text
link_ <- webpage %>%
rvest::html_nodes(".title a") %>%
rvest::html_text()
return(data_frame(link = link_, url = url_))
}
#College names and Urls
allcollegeurls<-scraplinks(baseurl)
现在可以正常工作了,但是当我为每个URL使用read_html时,它显示了一个错误。
#Reading the each url
for (i in allcollegeurls$url) {
clgwebpage <- read_html(allcollegeurls$url[i])
}
错误:当前工作目录('C:/ Users / User / Documents')中不存在'NA'。
我什至使用了'break'命令,但仍然存在相同的错误-:
#Reading the each url
for (i in allcollegeurls$url) {
clgwebpage <- read_html(allcollegeurls$url[i])
if(is.na(allcollegeurls$url[i]))break
}
请帮助。
根据要求发布所有大学网址的str ::
> str(allcollegeurls)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 30 obs. of 2 variables:
$ link: chr "Netaji Subhas Institute of Technology, Delhi" "Hansraj
College, Delhi" "School of Business, University of Petroleum and Energy
Studies, D.." "Hindu College, Delhi" ...
$ url : chr "https://www.careers360.com/university/netaji-subhas-
university-of-technology-new-delhi"
"https://www.careers360.com/colleges/hansraj-college-delhi"
"https://www.careers360.com/colleges/school-of-business-university-of-
petroleum-and-energy-studies-dehradun"
"https://www.careers360.com/colleges/hindu-college-delhi" ...
答案 0 :(得分:2)
这项工作
purrr::map(allcollegeurls$url, read_html)
map函数:map函数通过将函数应用于每个元素并返回与输入长度相同的向量来转换其输入。我喜欢避免在R中使用for
。
答案 1 :(得分:0)
我今天的数据面临几乎相同的问题。
请从网址中删除所有NA
。
在我的情况下,错误是
错误:“”在当前工作目录中不存在。
我从应用该功能且有效的列中删除了空白。
上面的错误表明在NA
上无法应用该功能。