尊敬的Stackoverflow用户, 我正在尝试从网站的不同页面中抓取两个节点(今天是Psychology,而这些页面是指心理健康专业人员MHP)。
首先,我创建一个抓取函数,然后创建一个包含此函数的循环。 最终,我能够创建一个数据框。但是,我想包含(作为第三个变量)指向我抓取的各个页面的完整链接。 如何在数据框中包含此信息?
这是循环:
j <- 1 #set the running variable = to 1 (the MHP id will increase by one)
MHP_codes <- c(150130:150170) #therapist identifier range
df_list <- vector(mode = "list", length(MHP_codes)) #set up the vector
#that collects individual
#MHP information
for(code1 in MHP_codes) {
delayedAssign("do.next", {next})
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code1)
#Reading the HTML code from the website
URL <- tryCatch(read_html(URL),
error = function(e) force(do.next))
#tryCatch catches also those with missing URL
#however, if an error occur in those pages
#the loop stops; this is why we need delayedAssign
#and force(do.next) in tryCatch
df_list[[j]] <- getProfile(URL) #the function puts the scraped data
#into a row
na.omit(df_list) #this function eliminates rows with only NAs, which happens if the URL does not exist
j <- j + 1
}
final_df <- rbind.fill(df_list) #gather the vectors into one unique data set
我应该修改抓取功能吗?那就是getProfile()还是可以在循环内但在抓取功能之外创建新变量?以及我该怎么做?已经思考了好几天,现在,有了完整的数据集,我仍然在拖延这个问题。