R中的Web爬虫与标题和摘要

时间:2017-06-15 14:11:22

标签: r web rvest

我尝试从文章标题和每个链接的简短摘要中提取here的链接。 输出应该有文章标题和每篇文章的简短摘要,它们在同一页上。

我能够获得链接。您能否建议我如何获得每个链接的标题和摘要。请参阅下面的代码。

install.packages('rvest')

#Loading the rvest package
library('rvest')
library(xml2)


#Specifying the url for desired website to be scrapped
url <- 'http://money.howstuffworks.com/business-profiles.htm'


webpage <- read_html(url)

pg <- read_html(url)

head(html_attr(html_nodes(pg, "a"), "href"))

1 个答案:

答案 0 :(得分:2)

我们可以使用purrr检查每个节点并提取相关信息:

library(rvest)
library(purrr)

url <- 'http://money.howstuffworks.com/business-profiles.htm'
articles <- read_html(url) %>% 
    html_nodes('.infinite-item > .media') %>% 
    map_df(~{
        title <- .x %>% 
            html_node('.media-heading > h3') %>% 
            html_text()

        head <- .x %>% 
            html_node('p') %>% 
            html_text()

        link <- .x %>% 
            html_node('p > a') %>% 
            html_attr('href')

        data.frame(title, head, link, stringsAsFactors = F)
    })

head(articles)
#>                                                             title
#> 1                              How Amazon Same-day Delivery Works
#> 2              10 Companies That Completely Reinvented Themselves
#> 3                                10 Trade Secrets We Wish We Knew
#> 4                                           How Kickstarter Works
#> 5                          Can you get rich selling stuff online?
#> 6 Are the Golden Arches really supposed to be giant french fries?
#>                                                                                                                                                           head
#> 1                 The Amazon same-day delivery service aims to get your package to you in no time at all. Learn how Amazon same-day delivery works. See more »
#> 2 You might be surprised at what some of today's biggest companies used to do. Here are 10 companies that reinvented themselves from HowStuffWorks. See more »
#> 3              Trade secrets are often locked away in corporate vaults, making their owners a fortune. Which trade secrets are the stuff of legend? See more »
#> 4        Kickstarter is a service that utilizes crowdsourcing to raise funds for your projects. Learn about how Kickstarter works at HowStuffWorks. See more »
#> 5                                                   Can you get rich selling your stuff online? Find out more in this article by HowStuffWorks.com. See more »
#> 6     Are McDonald's golden arches really suppose to be giant french fries? Check out this article for a brief history of McDonald's golden arches. See more »
#>                                                                    link
#> 1           http://money.howstuffworks.com/amazon-same-day-delivery.htm
#> 2 http://money.howstuffworks.com/10-companies-reinvented-themselves.htm
#> 3                   http://money.howstuffworks.com/10-trade-secrets.htm
#> 4                        http://money.howstuffworks.com/kickstarter.htm
#> 5    http://money.howstuffworks.com/can-you-get-rich-selling-online.htm
#> 6                   http://money.howstuffworks.com/mcdonalds-arches.htm

强制性评论:在这种情况下,我看到他们Terms and conditions没有收到关于收获的免责声明,但在抓取之前务必检查网站的条款。