Question

我正在尝试以HTML格式解析Pubmed Central文章。我想分别提取每个部分。例如，摘要，介绍，结果等。

示例：

doc.html = htmlTreeParse(paste("http://www.ncbi.nlm.nih.gov/pmc/articles/",2242602,"/", sep=""), useInternal = FALSE)

doc.html

给出以下内容：

<div id="sec1" class="tsec sec">
<h2 class="head no_bottom_margin" id="sec1title">1. Introduction</h2>
<p id="__p4" class="p p-first-last">In an effort to extend the structural coverage of proteins for which the biological function is unknown and cannot be deduced by homology, domain of unknown function (DUF) targets were selected from Pfam protein family PF01796 (DUF35). Here, we report the crystal structure of SSO2064, the first structural representative of this family, which was determined using the semiautomated high-throughput pipeline of the Joint Center for Structural Genomics (JCSG; <a href="http://www.jcsg.org" ref="reftype=extlink&amp;article-id=2954200&amp;issue-id=190704&amp;journal-id=381&amp;FROM=Article%7CBody&amp;TO=External%7CLink%7CURI&amp;rendering-type=normal" target="pmc_ext">http://www.jcsg.org</a>; Lesley <em>et al.</em>, 2002<a href="#bb31" rid="bb31" class=" bibr popnode tag_hotlink tag_tooltip" id="__tag_198914221"> ▶</a>) as part of the National Institute of General Medical Sciences (NIGMS) Protein Structure Initiative (PSI). The <em>SSO2064</em> gene of <em>Sulfolobus solfataricus</em>, a hyperthermoacidophilic crenarchaeon (She <em>et al.</em>, 2001<a href="#bb38" rid="bb38" class=" bibr popnode tag_hotlink tag_tooltip" id="__tag_300464760"> ▶</a>), encodes a protein with a molecular weight of 16.5 kDa (residues 1–144) and a calculated isoelectric point of 6.6. Structural analysis of SSO2064 revealed two N-terminal helices followed by a rubredoxin-like zinc ribbon and an oligonucletide/oligosaccharide-binding (OB) fold domain; the genome context and operon organization suggest a role in lipid and polyketide antibiotic biosynthesis.</p>

<div id="sec2" class="tsec sec">
<h2 class="head no_bottom_margin" id="sec2title">2. Materials and methods</h2>
<h3>2.1. Protein production and crystallization</h3>     
<p id="__p5" class="p p-first-last"> Clones were generated using the Polymerase Incomplete Primer</p>

我想使用他们的sec ID分别提取简介和材料和方法部分。

我可以为sec1执行以下操作：

id_or_class_xp <- "//div[@id='__sec1']"
doc = htmlParse(xpathSApply( doc.html,id_or_class_xp,xmlValue), asText=TRUE)
fullTextSection <- as.String(xpathSApply(doc, "//p", xmlValue))

由于我不知道文章中的部分确切数量，有没有办法用sec id提取每个部分，例如在循环中？

如何通过R中的section id（tags）解析HTML文章

0 个答案: