使用R从xml文件中选择特定文本

时间:2014-06-02 19:21:23

标签: xml r parsing

我需要使用R从xml文档中选择特定文本。我需要拉出的区域之前和之后的语法是不变的,因此当我通过我的脚本运行它时,它将与许多xml文件一起使用。

例如使用mock xml文档..

<head>
  <image name="test1">
    <nodes>
      <alt>Synthesis1</alt>
    </node>
    <body> There is a lot of text in this section, THIS IS WHAT I NEED TO SELECT, Here is some more text in the section
    </body>
    <body> Here is the next section, THIS IS AGAIN WHAT I NEED TO SELECT, Here is more text afterwards
    </body>
  </image>
</head>

我一直在使用R中的XML包而没有运气。有什么建议?谢谢!

1 个答案:

答案 0 :(得分:0)

尝试

library(XML)
doc <- htmlParse('<head>
  <image name="test1">
    <nodes>
      <alt>Synthesis1</alt>
    </node>
    <body> There is a lot of text in this section, THIS IS WHAT I NEED TO SELECT, Here is some more text in the section
    </body>
    <body> Here is the next section, THIS IS AGAIN WHAT I NEED TO SELECT, Here is more text afterwards
    </body>
  </image>
</head>')
doc["//body"]

sapply(doc["//body"], xmlValue, trim = TRUE)
# [1] "There is a lot of text in this section, THIS IS WHAT I NEED TO SELECT, Here is some more text in the section"
# [2] "Here is the next section, THIS IS AGAIN WHAT I NEED TO SELECT, Here is more text afterwards"