所以这就是我的情况:我已经通过一系列网址获得了很多成功,通常是通过抓取href(s)并将它们附加到域来创建的。这是我在这里使用的策略
data = list()
for(i in 1:length(classes)){
course <- read_html(classes[i])
title <- course%>%
html_node('h1')%>%
html_text()
description <- course%>%
html_node('.block_content')%>%
html_text()
data[[length(data) + 1]] <- list(Title=title, Description=description)
}
&#13;
类是一堆看起来像这样的字符串(切断结尾并开始,因为它们是链接而我没有代表)
[1] "ttp://catalog.pomona.edu/preview_course_nopop.php?catoid="
[2] "ttp://catalog.pomona.edu/preview_course_nopop.php?catoid="
[3] "ttp://catalog.pomona.edu/preview_course_nopop.php?catoid="
[4] "ttp://catalog.pomona.edu/preview_course_nopop.php?catoid="
[5] "ttp://catalog.pomona.edu/preview_course_nopop.php?catoid="
...
[2340] "ttp://catalog.pomona.edu/preview_course_nopop.php?catoid"
单独测试链接时没有问题;如果我请求特定的URL而不是整个索引,循环也将正常运行。但是,如果我在整个类的长度上运行它,它会运行很长时间并只返回一个结果
> description
[1] "\n \n \t\t\t\t\t\tHELP\n\t\t\t\t\t\t2017-2018 Pomona College Catalog Print-Friendly Page [Add to Portfolio] \n THEA199IRPO - Theatre: Independent ResearchWhen Offered: Each semester.Instructor(s): StaffCredit: 0.5-1A substantial and significant piece of original research or creative product produced. Prerequisite course work required. Available for full or half-course credit. Back to Top | Print-Friendly Page [Add to Portfolio] "
> title
[1] "THEA199IRPO - Theatre: Independent Research"
我老老实实地考虑到a)我之前已经成功了,b)链接没有被打破。我也没有收到任何错误消息。任何帮助都非常欢迎!