我需要使用R从xml文档中选择特定文本。我需要拉出的区域之前和之后的语法是不变的,因此当我通过我的脚本运行它时,它将与许多xml文件一起使用。
例如使用mock xml文档..
<head>
<image name="test1">
<nodes>
<alt>Synthesis1</alt>
</node>
<body> There is a lot of text in this section, THIS IS WHAT I NEED TO SELECT, Here is some more text in the section
</body>
<body> Here is the next section, THIS IS AGAIN WHAT I NEED TO SELECT, Here is more text afterwards
</body>
</image>
</head>
我一直在使用R中的XML包而没有运气。有什么建议?谢谢!
答案 0 :(得分:0)
尝试
library(XML)
doc <- htmlParse('<head>
<image name="test1">
<nodes>
<alt>Synthesis1</alt>
</node>
<body> There is a lot of text in this section, THIS IS WHAT I NEED TO SELECT, Here is some more text in the section
</body>
<body> Here is the next section, THIS IS AGAIN WHAT I NEED TO SELECT, Here is more text afterwards
</body>
</image>
</head>')
doc["//body"]
或
sapply(doc["//body"], xmlValue, trim = TRUE)
# [1] "There is a lot of text in this section, THIS IS WHAT I NEED TO SELECT, Here is some more text in the section"
# [2] "Here is the next section, THIS IS AGAIN WHAT I NEED TO SELECT, Here is more text afterwards"