R中的网页抓取,这是与电子邮件相关的奇怪跨度类

时间:2018-08-09 18:08:25

标签: r web-scraping

我正在尝试从以下html行抓取电子邮件

<p><span>E-mail address:</span><a title="&#xA; Link to email address&#xA;  
  "href="mailto:joeschmoe123@goodtimes.com">joeschmoe123@goodtimes.com</a></p>

现在,我不知道为什么这个span类对我来说如此困难。

我正在使用R,但不确定如何解决此问题。

email <- html_text(
    html_nodes(doc, ?????)

这是我当前正在使用的刮刀

scrape <- function(x){
doc<-read_html(x)
author <- html_text(html_nodes(doc, '.art_authors'))
year <- html_text(html_nodes(doc, '.year'))
journalName <- html_text(html_nodes(doc, '.journalName'))
art_title <- html_text(html_nodes(doc, '.art_title'))
volume <- html_text(html_nodes(doc, '.volume'))
page <- html_text(html_nodes(doc, '.page'))
email <- html_text(html_nodes(doc, xpath = "//a[@class = 'email']"))
email2 <- html_text(html_nodes(doc, xpath = "//a[@class = 'ext-link']"))
    Author = ifelse(length(author)==0, NA, author)
    Year = ifelse(length(year)==0, NA, year)
    Journal_Name = ifelse(length(journalName)==0, NA, journalName) 
    Art_Title = ifelse(length(art_title)==0, NA, art_title)
    Volume = ifelse(length(volume)==0, NA, volume)
    Page = ifelse(length(page)==0, NA, page)
    Email = ifelse(length(email)==0, NA, 
    ifelse(length(email)==1, email, paste(email, collapse=" ; ")))
    Email2 = ifelse(length(email2)==0, NA, 
    ifelse(length(email2)==1, email2, paste(email2, collapse=" ; ")))
row<-cbind(Author, Year, Journal_Name, Art_Title, Volume, Page, Email, Email2)
}

2 个答案:

答案 0 :(得分:2)

您还可以使用rvest::html_attr()选择'a'标签:

library(rvest)
doc <- read_html('<p><span>E-mail address:</span><a title="&#xA; Link to email 
address&#xA; "href="mailto:joeschmoe123@goodtimes.com">joeschmoe123@goodtimes.com</a></p>')

doc %>% html_node('a') %>% html_attr('href') %>% str_remove('mailto:')

## > doc %>% html_node('a') %>% html_attr('href') %>% str_remove('mailto:')
## [1] "joeschmoe123@goodtimes.com"

答案 1 :(得分:1)

我猜最简单的方法是选择<a>属性中具有“ mailto:”的href=标签。这就是你要怎么做

library(xml2)
library(rvest)
doc <- read_html('<p><span>E-mail address:</span><a title="&#xA; Link to email address&#xA;  
                   "href="mailto:joeschmoe123@goodtimes.com">joeschmoe123@goodtimes.com</a></p>')

html_nodes(doc, xpath='//a[starts-with(@href,"mailto:")]') %>% html_text()
# [1] "joeschmoe123@goodtimes.com"