如何使用R从XML标记中删除文本
<Primitives page="3">
<component id="Ad3E0" name="" type="Ad" size="Medium" page="3" page-label="1" section="White Pages Business" name-from-body="true" publication="AT2" issue-date="1772-09-15" publication-title="AT2" source-type="PDF" words="46" total-words="46" depth="0">
<chunk id="Ad3E0" index="1" type="Ad" page="3" label="1" size="Medium" word-count="46" resolution="200">
<metadata>
<field name="iPadAvailable"><value>True</value></field>
</metadata>
<Content>
<Primitive id="Ar3E1P1" top="531" left="98" width="401"
height="98" toc-entry-id="3" presentation-index="1" image-
extension=".png"><p auto="true">
<smartTag type="NM">A AA A Alcohol Abuse & Drug
Addiction<br />Detox-Rehab Treatment</smartTag> Center 24
Hour Helpline<br />Andrsn 780-9000<br /><smartTag type="NM">A
AA A Alcohol Alcohol Rehab & Drug Rehab And 24<br />Hour
Addiction</smartTag> Helpline Andrsn 639-0167</p>
</Primitive>
</Content>
</chunk>
</component>
</Primitives>
我在R中使用XML库 尝试使用此命令
xpathSApply(xmltop[[2]][[1]][[3]][[1]],'//*/Primitive[text()]')
我希望输出像
[1]
Name :A AA A Alcohol Abuse & Drug Addiction Detox-Rehab Treatment Center 24 Hour Helpline: Andrsn
Number: 780-9000
[2]
Name :AA A Alcohol Alcohol Rehab & Drug Rehab And 24Hour Addiction
Helpline : Andrsn
Number: 639-0167
答案 0 :(得分:1)
其中一种方法可能是
library(xml2)
library(stringr)
library(dplyr)
#read xml
doc <- read_xml(txt)
#extract text from xml node
res <- xml_find_all(doc, ".//Primitive") %>%
xml_text() %>%
#clean text
gsub("\\n|\\s\\s+", " ", .) %>%
trimws() %>%
gsub("(-\\d+)\\s", "\\1,", .) %>%
strsplit(split = ',') %>%
.[[1]]
#final result
df <- data.frame(Name = trimws(gsub("Helpline.*$", "", res)),
Helpline = trimws(gsub('^.*Helpline\\s*|\\s*\\d+-\\d+$', '', res)),
Number = trimws(str_extract(res, "\\d+-\\d+")))
df
# Name Helpline Number
#1 A AA A Alcohol Abuse & Drug AddictionDetox-Rehab Treatment Center 24 Hour Andrsn 780-9000
#2 A AA A Alcohol Alcohol Rehab & Drug Rehab And 24Hour Addiction Andrsn 639-0167
示例数据
txt <- '<Primitives page="3">
<component id="Ad3E0" name="" type="Ad" size="Medium" page="3" page-label="1" section="White Pages Business" name-from-body="true" publication="AT2" issue-date="1772-09-15" publication-title="AT2" source-type="PDF" words="46" total-words="46" depth="0">
<chunk id="Ad3E0" index="1" type="Ad" page="3" label="1" size="Medium" word-count="46" resolution="200">
<metadata>
<field name="iPadAvailable"><value>True</value></field>
</metadata>
<Content>
<Primitive id="Ar3E1P1" top="531" left="98" width="401" height="98" toc-entry-id="3" presentation-index="1" image-extension=".png">
<p auto="true">
<smartTag type="NM">
A AA A Alcohol Abuse & Drug Addiction<br />Detox-Rehab Treatment
</smartTag>
Center 24 Hour Helpline<br />Andrsn 780-9000<br />
<smartTag type="NM">
A AA A Alcohol Alcohol Rehab & Drug Rehab And 24<br />Hour Addiction
</smartTag>
Helpline Andrsn 639-0167
</p>
</Primitive>
</Content>
</chunk>
</component>
</Primitives>'