R中的XML标签解析器

时间:2018-07-06 09:34:24

标签: r xml

如何使用R从XML标记中删除文本

<Primitives page="3">
  <component id="Ad3E0" name="" type="Ad" size="Medium" page="3" page-label="1" section="White Pages Business" name-from-body="true" publication="AT2" issue-date="1772-09-15" publication-title="AT2" source-type="PDF" words="46" total-words="46" depth="0">
    <chunk id="Ad3E0" index="1" type="Ad" page="3" label="1" size="Medium" word-count="46" resolution="200">
      <metadata>
        <field name="iPadAvailable"><value>True</value></field>
      </metadata>
            <Content>
              <Primitive id="Ar3E1P1" top="531" left="98" width="401" 
               height="98" toc-entry-id="3" presentation-index="1" image-
                extension=".png"><p auto="true">
                <smartTag type="NM">A AA A Alcohol Abuse &amp; Drug 
                Addiction<br />Detox-Rehab Treatment</smartTag> Center 24 
                Hour Helpline<br />Andrsn 780-9000<br /><smartTag type="NM">A 
                AA A Alcohol Alcohol Rehab &amp; Drug Rehab And 24<br />Hour 
                Addiction</smartTag> Helpline Andrsn 639-0167</p>
                </Primitive>
          </Content>
        </chunk>
      </component>
    </Primitives>

我在R中使用XML库 尝试使用此命令

xpathSApply(xmltop[[2]][[1]][[3]][[1]],'//*/Primitive[text()]')

我希望输出像

[1]
Name :A AA A Alcohol Abuse & Drug Addiction Detox-Rehab Treatment Center 24 Hour Helpline:  Andrsn  
Number: 780-9000

[2]
Name :AA A Alcohol Alcohol Rehab &amp; Drug Rehab And 24Hour Addiction 
Helpline :  Andrsn  
Number: 639-0167

1 个答案:

答案 0 :(得分:1)

其中一种方法可能是

library(xml2)
library(stringr)
library(dplyr)

#read xml
doc <- read_xml(txt)

#extract text from xml node
res <- xml_find_all(doc, ".//Primitive") %>% 
  xml_text() %>%
#clean text
  gsub("\\n|\\s\\s+", " ", .) %>%
  trimws() %>%
  gsub("(-\\d+)\\s", "\\1,", .) %>%
  strsplit(split = ',') %>%
  .[[1]]

#final result
df <- data.frame(Name     = trimws(gsub("Helpline.*$", "", res)),
                 Helpline = trimws(gsub('^.*Helpline\\s*|\\s*\\d+-\\d+$', '', res)),
                 Number   = trimws(str_extract(res, "\\d+-\\d+")))

df
#                                                                       Name Helpline   Number
#1 A AA A Alcohol Abuse & Drug AddictionDetox-Rehab Treatment Center 24 Hour   Andrsn 780-9000
#2            A AA A Alcohol Alcohol Rehab & Drug Rehab And 24Hour Addiction   Andrsn 639-0167


示例数据

txt <- '<Primitives page="3">
          <component id="Ad3E0" name="" type="Ad" size="Medium" page="3" page-label="1" section="White Pages Business" name-from-body="true" publication="AT2" issue-date="1772-09-15" publication-title="AT2" source-type="PDF" words="46" total-words="46" depth="0">
            <chunk id="Ad3E0" index="1" type="Ad" page="3" label="1" size="Medium" word-count="46" resolution="200">
              <metadata>
                <field name="iPadAvailable"><value>True</value></field>
              </metadata>
              <Content>
                <Primitive id="Ar3E1P1" top="531" left="98" width="401" height="98" toc-entry-id="3" presentation-index="1" image-extension=".png">
                  <p auto="true">
                    <smartTag type="NM">
                      A AA A Alcohol Abuse &amp; Drug Addiction<br />Detox-Rehab Treatment
                    </smartTag>
                      Center 24 Hour Helpline<br />Andrsn 780-9000<br />
                    <smartTag type="NM">
                      A AA A Alcohol Alcohol Rehab &amp; Drug Rehab And 24<br />Hour Addiction
                    </smartTag> 
                      Helpline Andrsn 639-0167
                  </p>
                </Primitive>
              </Content>
            </chunk>
          </component>
        </Primitives>'