“下一页”的功能是rvest scrape

时间:2017-04-05 13:23:52

标签: r scrape rvest

我添加了我在底部使用的最终代码,以防任何人有类似的问题。我使用了下面提供的答案,但添加了几个节点,系统休眠时间(以防止被踢出服务器),以及一个if参数,以防止在上一个有效页面被删除后出现错误。

我正在尝试使用下一页功能从网站上提取多个页面。我创建了一个带有nextpage变量的数据框,并用起始url填充了第一个值。

##create html object
blogfunc    <-  read_html("http://www.ashleyannphotography.com/blog/2017/04/02/canopy-anna-turner/")
##create object with post content scraped
posttext    <-  blogfunc    %>% 
    html_nodes(".article-content")%>%           
    html_text()                 
posttext    <-  gsub('[\a]', '', blogfunc)
posttext    <-  gsub('[\t]', '', blogfunc)
posttext    <-  gsub('[\n]', '', blogfunc)
##scrape next url
nexturl <-  blogfunc    %>% 
    html_nodes(".prev-post-link-wrap a") %>%    
    html_attr("href")           

我想按如下方式提取文字(我知道代码很笨重 - 我在这方面很崭新 - 但它确实得到了我想要的东西)

```{r}
library(rvest)    
url <- "http://www.ashleyannphotography.com/blog/2008/05/31/the-making-of-a-wet-willy/"
#Select first page.

getPostContent <- function(url){
    Sys.sleep(2)
    #Introduces pauses to convince server not robot.
    read_html(url) %>% 
        html_nodes(".article-content")%>%           
        html_text() %>% 
        gsub(x = ., pattern = '[\a\t\n]', replacement = '')
  }
#Pulls node for post content.

getDate <- function(url) {
    Sys.sleep(2.6)
    read_html(url) %>% 
        html_node(".updated") %>%
        html_text()
}
#Pulls node for date.

getTitle <- function(url) {
    Sys.sleep(.8)
    read_html(url) %>% 
        html_node(".article-title") %>%
        html_text()
    }
#Pulls node for title.

getNextUrl <- function(url) {
    Sys.sleep(.2)
    read_html(url) %>% 
        html_node(".prev-post-link-wrap a") %>%
        html_attr("href")
    }
#Pulls node for url to previous post.

scrapeBackMap <- function(url, n){
    Sys.sleep(3)
    purrr::map_df(1:n, ~{
        if(!is.na(url)){
#Only run if URL is not NA
        oUrl <- url
        date <- getDate(url)
        post <- getPostContent(url)
        title <- getTitle(url)
        url <<- getNextUrl(url)

        data.frame(curpage = oUrl, 
                        nexturl = url,
                        posttext = post,
                        pubdate = date,
                        ptitle = title
#prepares functions for dataframe
                        )}
    })
}
   res <- scrapeBackMap(url, 3000)
   class(res)
   str(res)
#creates dataframe
```

有关将上述内容转换为函数并使用它填写数据框的任何建议吗?我正在努力应用在线示例。

使用睡眠时间和最后一页有效页面之后的参数进行回答。

var results = db.Products.Select(p => p.CategoryType).ToArray();
var Matches = results.Where(p => searchString.Split(new []{' '}).Any(x => p.Contains(x)).Distinct();

1 个答案:

答案 0 :(得分:3)

我跟随的想法是抓取每个帖子的内容,找到以前的帖子&#39;网址,导航到该网址并重复此过程。

library(rvest)    

url <-  "http://www.ashleyannphotography.com/blog/2017/04/02/canopy-anna-turner/"

删除帖子的内容

getPostContent <- function(url){
    read_html(url) %>% 
        html_nodes(".article-content")%>%           
        html_text() %>% 
        gsub(x = ., pattern = '[\a\t\n]', replacement = '')
    }

抓下一个网址

getNextUrl <- function(url) {
    read_html(url) %>% 
        html_node(".prev-post-link-wrap a") %>%
        html_attr("href")
}

一旦我们获得了这些支持&#39;功能我们可以把它们粘在一起。

应用函数n

我想for循环或while可能会设置为继续getNextUrl返回NULL,但我更喜欢定义n跳转返回并在每次跳跃时应用该功能。

从原始url开始,我们检索其内容,然后使用提取的新值覆盖url并继续,直到循环中断。

scrapeBackApply <- function(url, n) {
    sapply(1:n, function(x) {
        r <- getPostContent(url)
        # Overwrite global 'url'
        url <<- getNextUrl(url)
        r
    })
}

或者,我们可以使用purrr::map系列和map_df来直接获取data.frame作为bframe

scrapeBackMap <- function(url, n) {
    purrr::map_df(1:n, ~{
        oUrl <- url
        post <- getPostContent(url)
        url <<- getNextUrl(url)
        data.frame(curpage = oUrl, 
                        nexturl = url,
                        posttext = post)
    })
}

结果

res <- scrapeBackApply(url, 2)
class(res)
#> [1] "character"
str(res)
#>  chr [1:2] "Six years ago this month, my eldest/oldest/elder/older daughter<U+0085>Okay sidenote <U+0096> the #1 grammar correction I receive on a regula"| __truncated__ ...

res <- scrapeBackMap(url, 4)
class(res)
#> [1] "data.frame"
str(res)
#> 'data.frame':    4 obs. of  3 variables:
#>  $ curpage : chr  "http://www.ashleyannphotography.com/blog/2017/04/02/canopy-anna-turner/" "http://www.ashleyannphotography.com/blog/2017/03/31/a-guest-post-an-snapshop-interview/" "http://www.ashleyannphotography.com/blog/2017/03/29/explore-il-casey-small-town-big-things/" "http://www.ashleyannphotography.com/blog/2017/03/27/explore-ok-oklahoma-wondertorium/"
#>  $ nexturl : chr  "http://www.ashleyannphotography.com/blog/2017/03/31/a-guest-post-an-snapshop-interview/" "http://www.ashleyannphotography.com/blog/2017/03/29/explore-il-casey-small-town-big-things/" "http://www.ashleyannphotography.com/blog/2017/03/27/explore-ok-oklahoma-wondertorium/" "http://www.ashleyannphotography.com/blog/2017/03/24/the-youngest-cousin/"
#>  $ posttext: chr  "Six years ago this month, my eldest/oldest/elder/older daughter<U+0085>Okay sidenote <U+0096> the #1 grammar correction I receive on a regula"| __truncated__ "Today I am guest posting over on the Bought Beautifully blog about something new my family tried as a way to usher in our Easte"| __truncated__ "A couple of weeks ago, we drove to Illinois to watch one my nieces in a track meet and another niece in her high school musical"| __truncated__ "Often the activities we do as a family tend to cater more towards our older kids than the girls. The girls are always in the mi"| __truncated__