Question

尊敬的Stackoverflow用户，我正在尝试从网站的不同页面中抓取两个节点（今天是Psychology，而这些页面是指心理健康专业人员MHP）。

首先，我创建一个抓取函数，然后创建一个包含此函数的循环。最终，我能够创建一个数据框。但是，我想包含（作为第三个变量）指向我抓取的各个页面的完整链接。如何在数据框中包含此信息？

这是循环：

j <- 1 #set the running variable = to 1 (the MHP id will increase by one)
MHP_codes <-  c(150130:150170) #therapist identifier range
df_list <- vector(mode = "list", length(MHP_codes)) #set up the vector 
                                                    #that collects individual
                                                    #MHP information
for(code1 in MHP_codes) {
  delayedAssign("do.next", {next})
  URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code1)
  #Reading the HTML code from the website
  URL <-  tryCatch(read_html(URL), 
           error = function(e) force(do.next)) 
                #tryCatch catches also those with missing URL
                #however, if an error occur in those pages
                #the loop stops; this is why we need delayedAssign
                #and force(do.next) in tryCatch
  df_list[[j]] <- getProfile(URL) #the function puts the scraped data
                                  #into a row  
  na.omit(df_list) #this function eliminates rows with only NAs, which happens if the URL does not exist
  j <- j + 1
}
final_df <- rbind.fill(df_list) #gather the vectors into one unique data set

我应该修改抓取功能吗？那就是getProfile（）还是可以在循环内但在抓取功能之外创建新变量？以及我该怎么做？已经思考了好几天，现在，有了完整的数据集，我仍然在拖延这个问题。

此帖子与其他两个帖子有关，另外还提供了有关抓取功能的更多信息，post1和post2。

R-从网站页面抓取节点-如何在最终数据框中插入单个页面链接

0 个答案: