您好我正在尝试检索这些wepages元描述
来自网页来源"
Data<-data.frame(Pages=c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html"))
期望的输出
Data$Meta_Description<-data.frame(Extracted=c(
"Sanford Wallace gets 2.5 years in prison for 27 million Facebook",
"OMG, this Japanese Trump Commercial is everything",
"Omar Mateen posted to Facebook during Orlando mass shooting"))
我试图用httr来完成这个任务,但是我无法以所需的输出格式获取它或者从使用GET命令检索的内容中提取内容
library (httr)
resp<-GET ("http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html")
str(resp)
List of 10
$ url : chr "http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html"
$ status_code: int 200
$ headers :List of 22
..$ server : chr "Apache/2.2"
我需要从源代码中提取的字段位于此字符串
之后<meta itemprop="description" content="
喜欢这样
<meta itemprop="description" content="'Spam King'
Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
答案 0 :(得分:6)
你真的只需要rvest
。由于他们是所有<h1>
标题,您可以遍历网址列表,挑选标题:
library(rvest)
sapply(Data$Pages,
function(url){
url %>%
as.character() %>% # in case strings are stored as factors
read_html() %>%
html_nodes('h1') %>%
html_text()
})
# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
或者,如果你真的想要抓取<meta>
标签,你可以用同样的方式来做,虽然选择器更加痛苦:
sapply(Data$Pages, function(url){
url %>%
as.character() %>%
read_html() %>%
html_nodes(xpath = '//meta[@itemprop="description"]') %>%
html_attr('content')
})
无论哪种方式都会得到相同的结果。