我正在尝试从以下html行抓取电子邮件
<p><span>E-mail address:</span><a title="
 Link to email address

"href="mailto:joeschmoe123@goodtimes.com">joeschmoe123@goodtimes.com</a></p>
现在,我不知道为什么这个span类对我来说如此困难。
我正在使用R,但不确定如何解决此问题。
email <- html_text(
html_nodes(doc, ?????)
这是我当前正在使用的刮刀
scrape <- function(x){
doc<-read_html(x)
author <- html_text(html_nodes(doc, '.art_authors'))
year <- html_text(html_nodes(doc, '.year'))
journalName <- html_text(html_nodes(doc, '.journalName'))
art_title <- html_text(html_nodes(doc, '.art_title'))
volume <- html_text(html_nodes(doc, '.volume'))
page <- html_text(html_nodes(doc, '.page'))
email <- html_text(html_nodes(doc, xpath = "//a[@class = 'email']"))
email2 <- html_text(html_nodes(doc, xpath = "//a[@class = 'ext-link']"))
Author = ifelse(length(author)==0, NA, author)
Year = ifelse(length(year)==0, NA, year)
Journal_Name = ifelse(length(journalName)==0, NA, journalName)
Art_Title = ifelse(length(art_title)==0, NA, art_title)
Volume = ifelse(length(volume)==0, NA, volume)
Page = ifelse(length(page)==0, NA, page)
Email = ifelse(length(email)==0, NA,
ifelse(length(email)==1, email, paste(email, collapse=" ; ")))
Email2 = ifelse(length(email2)==0, NA,
ifelse(length(email2)==1, email2, paste(email2, collapse=" ; ")))
row<-cbind(Author, Year, Journal_Name, Art_Title, Volume, Page, Email, Email2)
}
答案 0 :(得分:2)
您还可以使用rvest::html_attr()
选择'a'标签:
library(rvest)
doc <- read_html('<p><span>E-mail address:</span><a title="
 Link to email
address
 "href="mailto:joeschmoe123@goodtimes.com">joeschmoe123@goodtimes.com</a></p>')
doc %>% html_node('a') %>% html_attr('href') %>% str_remove('mailto:')
## > doc %>% html_node('a') %>% html_attr('href') %>% str_remove('mailto:')
## [1] "joeschmoe123@goodtimes.com"
答案 1 :(得分:1)
我猜最简单的方法是选择<a>
属性中具有“ mailto:”的href=
标签。这就是你要怎么做
library(xml2)
library(rvest)
doc <- read_html('<p><span>E-mail address:</span><a title="
 Link to email address

"href="mailto:joeschmoe123@goodtimes.com">joeschmoe123@goodtimes.com</a></p>')
html_nodes(doc, xpath='//a[starts-with(@href,"mailto:")]') %>% html_text()
# [1] "joeschmoe123@goodtimes.com"