R从网页中解析不完整的文本(HTML)

时间:2016-07-13 09:37:49

标签: html r xml text-mining rvest

我试图解析多篇科学文章中的纯文本,以便进行后续的文本分析。到目前为止,我使用基于 RCurl XML 的包R script by Tony Breyal。这适用于所有目标期刊,但 http://www.sciencedirect.com 发布的期刊除外。当我尝试从SD解析文章时(这对于我需要从SD访问的所有测试期刊都是一致的),R中的文本对象只是将整个文档的第一部分存储在其中。不幸的是,我不太熟悉html,但我认为问题应该在SD html代码中,因为它适用于所有其他情况。 我知道有些期刊不是开放式的,但是我有访问权限,而且开放访问文章也会出现问题(查看示例)。 这是Github的代码:

 htmlToText <- function(input, ...) {
###---PACKAGES ---###
 require(RCurl)
 require(XML)


###--- LOCAL FUNCTIONS ---###
# Determine how to grab html for a single input element
 evaluate_input <- function(input) {    
# if input is a .html file
if(file.exists(input)) {
  char.vec <- readLines(input, warn = FALSE)
  return(paste(char.vec, collapse = ""))
}

# if input is html text
if(grepl("</html>", input, fixed = TRUE)) return(input)

# if input is a URL, probably should use a regex here instead?
if(!grepl(" ", input)) {
  # downolad SSL certificate in case of https problem
  if(!file.exists("cacert.perm")) download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.perm")
  return(getURL(input, followlocation = TRUE, cainfo = "cacert.perm"))
}

# return NULL if none of the conditions above apply
return(NULL)
}

# convert HTML to plain text
convert_html_to_text <- function(html) {
doc <- htmlParse(html, asText = TRUE)
text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
return(text)
}

# format text vector into one character string
collapse_text <- function(txt) {
return(paste(txt, collapse = " "))
 }

###--- MAIN ---###
# STEP 1: Evaluate input
html.list <- lapply(input, evaluate_input)

# STEP 2: Extract text from HTML
text.list <- lapply(html.list, convert_html_to_text)

# STEP 3: Return text
text.vector <- sapply(text.list, collapse_text)
return(text.vector)
}

这是我的代码和示例文章:

target <- "http://www.sciencedirect.com/science/article/pii/S1754504816300319"
temp.text <- htmlToText(target)

未格式化的文本在“方法”部分的某处停止:

  

使用MasterPure™酵母DNA纯化试剂盒提取DNA   (Epicenter,Madison,Wisconsin,USA)遵循制造商的说法   指令。

有任何建议/想法吗?

P.S。我还尝试了基于 html_text rvest ,结果相同。

1 个答案:

答案 0 :(得分:1)

您可以使用现有代码,只需将?np=y添加到URL的末尾,但这样更紧凑:

library(rvest)
library(stringi)

target <- "http://www.sciencedirect.com/science/article/pii/S1754504816300319?np=y"

pg <- read_html(target)
pg %>%
  html_nodes(xpath=".//div[@id='centerContent']//child::node()/text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]") %>% 
  stri_trim() %>% 
  paste0(collapse=" ") %>% 
  write(file="output.txt")

一点输出(该文章的总数> 80K):

 Fungal Ecology Volume 22 , August 2016, Pages 61–72        175394|| Species richness 
 influences wine ecosystem function through a dominant species Primrose J. Boynton a , , , 
 Duncan Greig a , b a  Max Planck Institute for Evolutionary Biology, Plön, 24306, Germany 
 b  The Galton Laboratory, Department of Genetics, Evolution, and Environment, University 
 College London, London, WC1E 6BT, UK Received 9 November 2015, Revised 27 March 2016, 
 Accepted 15 April 2016, Available online 1 June 2016 Corresponding editor: Marie Louise
 Davey Abstract Increased species richness does not always cause increased ecosystem function. 
 Instead, richness can influence individual species with positive or negative ecosystem effects. 
 We investigated richness and function in fermenting wine, and found that richness indirectly 
 affects ecosystem function by altering the ecological dominance of Saccharomyces cerevisiae . 
 While S. cerevisiae generally dominates fermentations, it cannot dominate extremely species-rich 
 communities, probably because antagonistic species prevent it from growing. It is also diluted 
 from species-poor communities,