我添加了我在底部使用的最终代码,以防任何人有类似的问题。我使用了下面提供的答案,但添加了几个节点,系统休眠时间(以防止被踢出服务器),以及一个if参数,以防止在上一个有效页面被删除后出现错误。
我正在尝试使用下一页功能从网站上提取多个页面。我创建了一个带有nextpage变量的数据框,并用起始url填充了第一个值。
##create html object
blogfunc <- read_html("http://www.ashleyannphotography.com/blog/2017/04/02/canopy-anna-turner/")
##create object with post content scraped
posttext <- blogfunc %>%
html_nodes(".article-content")%>%
html_text()
posttext <- gsub('[\a]', '', blogfunc)
posttext <- gsub('[\t]', '', blogfunc)
posttext <- gsub('[\n]', '', blogfunc)
##scrape next url
nexturl <- blogfunc %>%
html_nodes(".prev-post-link-wrap a") %>%
html_attr("href")
我想按如下方式提取文字(我知道代码很笨重 - 我在这方面很崭新 - 但它确实得到了我想要的东西)
```{r}
library(rvest)
url <- "http://www.ashleyannphotography.com/blog/2008/05/31/the-making-of-a-wet-willy/"
#Select first page.
getPostContent <- function(url){
Sys.sleep(2)
#Introduces pauses to convince server not robot.
read_html(url) %>%
html_nodes(".article-content")%>%
html_text() %>%
gsub(x = ., pattern = '[\a\t\n]', replacement = '')
}
#Pulls node for post content.
getDate <- function(url) {
Sys.sleep(2.6)
read_html(url) %>%
html_node(".updated") %>%
html_text()
}
#Pulls node for date.
getTitle <- function(url) {
Sys.sleep(.8)
read_html(url) %>%
html_node(".article-title") %>%
html_text()
}
#Pulls node for title.
getNextUrl <- function(url) {
Sys.sleep(.2)
read_html(url) %>%
html_node(".prev-post-link-wrap a") %>%
html_attr("href")
}
#Pulls node for url to previous post.
scrapeBackMap <- function(url, n){
Sys.sleep(3)
purrr::map_df(1:n, ~{
if(!is.na(url)){
#Only run if URL is not NA
oUrl <- url
date <- getDate(url)
post <- getPostContent(url)
title <- getTitle(url)
url <<- getNextUrl(url)
data.frame(curpage = oUrl,
nexturl = url,
posttext = post,
pubdate = date,
ptitle = title
#prepares functions for dataframe
)}
})
}
res <- scrapeBackMap(url, 3000)
class(res)
str(res)
#creates dataframe
```
有关将上述内容转换为函数并使用它填写数据框的任何建议吗?我正在努力应用在线示例。
使用睡眠时间和最后一页有效页面之后的参数进行回答。
var results = db.Products.Select(p => p.CategoryType).ToArray();
var Matches = results.Where(p => searchString.Split(new []{' '}).Any(x => p.Contains(x)).Distinct();
答案 0 :(得分:3)
我跟随的想法是抓取每个帖子的内容,找到以前的帖子&#39;网址,导航到该网址并重复此过程。
library(rvest)
url <- "http://www.ashleyannphotography.com/blog/2017/04/02/canopy-anna-turner/"
getPostContent <- function(url){
read_html(url) %>%
html_nodes(".article-content")%>%
html_text() %>%
gsub(x = ., pattern = '[\a\t\n]', replacement = '')
}
getNextUrl <- function(url) {
read_html(url) %>%
html_node(".prev-post-link-wrap a") %>%
html_attr("href")
}
一旦我们获得了这些支持&#39;功能我们可以把它们粘在一起。
n
次我想for
循环或while
可能会设置为继续getNextUrl
返回NULL
,但我更喜欢定义n
跳转返回并在每次跳跃时应用该功能。
从原始url
开始,我们检索其内容,然后使用提取的新值覆盖url
并继续,直到循环中断。
scrapeBackApply <- function(url, n) {
sapply(1:n, function(x) {
r <- getPostContent(url)
# Overwrite global 'url'
url <<- getNextUrl(url)
r
})
}
或者,我们可以使用purrr::map
系列和map_df
来直接获取data.frame
作为bframe
。
scrapeBackMap <- function(url, n) {
purrr::map_df(1:n, ~{
oUrl <- url
post <- getPostContent(url)
url <<- getNextUrl(url)
data.frame(curpage = oUrl,
nexturl = url,
posttext = post)
})
}
res <- scrapeBackApply(url, 2)
class(res)
#> [1] "character"
str(res)
#> chr [1:2] "Six years ago this month, my eldest/oldest/elder/older daughter<U+0085>Okay sidenote <U+0096> the #1 grammar correction I receive on a regula"| __truncated__ ...
res <- scrapeBackMap(url, 4)
class(res)
#> [1] "data.frame"
str(res)
#> 'data.frame': 4 obs. of 3 variables:
#> $ curpage : chr "http://www.ashleyannphotography.com/blog/2017/04/02/canopy-anna-turner/" "http://www.ashleyannphotography.com/blog/2017/03/31/a-guest-post-an-snapshop-interview/" "http://www.ashleyannphotography.com/blog/2017/03/29/explore-il-casey-small-town-big-things/" "http://www.ashleyannphotography.com/blog/2017/03/27/explore-ok-oklahoma-wondertorium/"
#> $ nexturl : chr "http://www.ashleyannphotography.com/blog/2017/03/31/a-guest-post-an-snapshop-interview/" "http://www.ashleyannphotography.com/blog/2017/03/29/explore-il-casey-small-town-big-things/" "http://www.ashleyannphotography.com/blog/2017/03/27/explore-ok-oklahoma-wondertorium/" "http://www.ashleyannphotography.com/blog/2017/03/24/the-youngest-cousin/"
#> $ posttext: chr "Six years ago this month, my eldest/oldest/elder/older daughter<U+0085>Okay sidenote <U+0096> the #1 grammar correction I receive on a regula"| __truncated__ "Today I am guest posting over on the Bought Beautifully blog about something new my family tried as a way to usher in our Easte"| __truncated__ "A couple of weeks ago, we drove to Illinois to watch one my nieces in a track meet and another niece in her high school musical"| __truncated__ "Often the activities we do as a family tend to cater more towards our older kids than the girls. The girls are always in the mi"| __truncated__