使用R在html标记内提取内容

时间:2015-10-03 10:05:28

标签: html regex r

我现在正试图在特定的html标签之间提取内容,例如:

<dl class="search-advanced-list">
<dt>
<h2><a id="/advanced-search?intercept=adv&amp;as-advanced=+documenttype%3Asource title:%22ADB%22&amp;as-type=advanced" name="ADB">ADB</a></h2>
</dt>
<dd>Allgemeine deutsche Biographie. Under the auspices of the Historical Commission of the Royal Academy of Sciences. 56 vols. Leipzig: Duncker &amp; Humblot. 1875&#8211;1912.</dd>
<dt>
<h2><a id="/advanced-search?intercept=adv&amp;as-advanced=+documenttype%3Asource title:%22AMS%22&amp;as-type=advanced" name="AMS">AMS</a></h2>
</dt>
<dd>American men of science. J. McKeen Cattell, ed. Editions 1&#8211;4, New York: 1906&#8211;27.</dd>
<dt>
<h2><a id="/advanced-search?intercept=adv&amp;as-advanced=+documenttype%3Asource title:%22Abbott%2C+C.+C.+1861%22&amp;as-type=advanced" name="Abbott__C__C__1861">Abbott, C. C. 1861</a></h2>
</dt>
<dd>Abbott, Charles Compton. 1861. Notes on the birds of the Falkland Islands. Ibis 3: 149&#8211;67.</dd>
...
</dl>

link

我计划在<h2> </h2>内以及<dd></dd>内的内容中提取内容。我在stackOverFlow中搜索了类似的问题,但仍然无法弄明白,是否有人使用R来解决这个问题的简单方法?

4 个答案:

答案 0 :(得分:3)

这会创建一个两列矩阵m,其第一列为h2,其第二列为dd个值。由于输入形式的问题中没有信息,我们假设输入是字符串Lines,但如果没有,htmlTreeParse行可以适当更改。请尝试?htmlTreeParse了解详情。

library(XML)
doc <- htmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)

f <- function(x) cbind(h2 = xmlValue(x), dd = xpathSApply(x, "//dd", xmlValue))
L <- xpathApply(doc, "//h2", f)
m <- do.call(rbind, L)

在这里,我们会显示h2列和dd列的前10个字符:

> cbind(h2 = m[,1], dd = substr(m[,2], 1, 10))

      h2                   dd          
 [1,] "ADB"                "Allgemeine"
 [2,] "ADB"                "American m"
 [3,] "ADB"                "Abbott, Ch"
 [4,] "AMS"                "Allgemeine"
 [5,] "AMS"                "American m"
 [6,] "AMS"                "Abbott, Ch"
 [7,] "Abbott, C. C. 1861" "Allgemeine"
 [8,] "Abbott, C. C. 1861" "American m"
 [9,] "Abbott, C. C. 1861" "Abbott, Ch"

这是上面使用的输入:

Lines <- '<dl class="search-advanced-list">
<dt>
<h2><a id="/advanced-search?intercept=adv&amp;as-advanced=+documenttype%3Asource title:%22ADB%22&amp;as-type=advanced" name="ADB">ADB</a></h2>
</dt>
<dd>Allgemeine deutsche Biographie. Under the auspices of the Historical Commission of the Royal Academy of Sciences. 56 vols. Leipzig: Duncker &amp; Humblot. 1875&#8211;1912.</dd>
<dt>
<h2><a id="/advanced-search?intercept=adv&amp;as-advanced=+documenttype%3Asource title:%22AMS%22&amp;as-type=advanced" name="AMS">AMS</a></h2>
</dt>
<dd>American men of science. J. McKeen Cattell, ed. Editions 1&#8211;4, New York: 1906&#8211;27.</dd>
<dt>
<h2><a id="/advanced-search?intercept=adv&amp;as-advanced=+documenttype%3Asource title:%22Abbott%2C+C.+C.+1861%22&amp;as-type=advanced" name="Abbott__C__C__1861">Abbott, C. C. 1861</a></h2>
</dt>
<dd>Abbott, Charles Compton. 1861. Notes on the birds of the Falkland Islands. Ibis 3: 149&#8211;67.</dd>
</dl>'

答案 1 :(得分:2)

或者,以正确的方式进行刮擦:

library(xml2)
library(rvest)

pg <- read_html("https://www.darwinproject.ac.uk/bibliography")

h2 <- html_text(html_nodes(pg, "dt > h2"))
head(h2)
## [1] "ADB"                            "AMS"                           
## [3] "Abbott, C. C. 1861"             "Abich, O. H. W. 1841"          
## [5] "Accum, Frederick. 1820"         "Acevedo Moraga, Fernando. 1987"

dd <- html_text(html_nodes(pg, "dd"))
head(dd)
## [1] "Allgemeine deutsche Biographie. Under the auspices of the Historical Commission of the Royal Academy of Sciences. 56 vols. Leipzig: Duncker & Humblot. 1875–1912."                                                                
## [2] "American men of science. J. McKeen Cattell, ed. Editions 1–4, New York: 1906–27."                                                                                                                                                 
## [3] "Abbott, Charles Compton. 1861. Notes on the birds of the Falkland Islands. Ibis 3: 149–67."                                                                                                                                       
## [4] "Abich, Otto Hermann Wilhelm. 1841. Geologische Betrachtungen über die vulkanischen Erscheinungen und Bildungen in Unter- und Mittel-Italien. Braunschweig."                                                                       
## [5] "Accum, Frederick. 1820. A treatise on the art of brewing, exhibiting the London practice of brewing porter, brown stout, ale, table beer, and various other kinds of malt liquors. London: Longman, Hurst, Rees, Orme, and Brown."
## [6] "Acevedo Moraga, Fernando. 1987. La Escuela de Minas de la Serena. In La Serena University, edited by Claudo Canut de Bon: 1–18. Chile."

我觉得有必要在其ToS中添加一个代码段:

  

根据法定免税额,可以访问,下载和打印本网站资料的摘录,供您个人和非商业用途,您可以提请组织内其他人注意网站上发布的材料。你可能不会:

     
      
  • 在未获得大学或其许可方许可的情况下,将本网站上的任何部分材料用于直接或间接的商业目的或优势
  •   
  • 您不得修改或更改以任何方式打印或下载的任何材料的纸质或数字副本
  •   
  • 以任何形式出售,转售,许可,转让,传输,展示,执行,出租,租赁或借出任何从网站上打印或下载的全部或部分内容
  •   
  • 系统地提取和/或重新利用网站上的大部分内容或材料
  •   
  • 创建和/或发布您自己的数据库,其中包含此站点的重要部分。
  •   
     

如果您违反这些使用条款打印,复制,下载或使用本网站的任何部分,您使用本网站的权利将立即停止,您必须由大学选择退回或销毁本网站的任何副本。你做的材料。

答案 2 :(得分:0)

htmlpattern <- "</?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>"
plain.text <- gsub(htmlpattern, "\\1", txt)
cat(plain.text)   

注意:txt是html文本

答案 3 :(得分:-1)

您可以使用正则表达式,并可以使用以下搜索字符串匹配数据

/\<dd\>(.*?)\</dd\>|\<h2\>(.*?)\</h2\>/g

enter image description here