基于R中的属性解析XML

时间:2015-02-24 16:45:46

标签: xml r parsing

我有一个类似这样的XML文件:

<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2010-03-19" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2010-03-19 http://schema.primaresearch.org/PAGE/gts/pagecontent/2010-03-19/pagecontent.xsd" pcGtsId="pc-00530982">
<Metadata>
<Page imageFilename="00530982.tif" imageWidth="3346" imageHeight="5328">
<TextRegion id="r2" readingDirection="left-to-right" type="paragraph">
<Coords>
<Point x="94" y="3372"/>
<Point x="356" y="3375"/>
<Point x="326" y="3375"/>
<Point x="317" y="3369"/>
<Point x="160" y="3368"/>
<Point x="152" y="3368"/></Coords>
<TextEquiv>
<Unicode>Obl. Atl. Gr. W. Spw. 7 pCt. 52½, ⅞, ¾; Debentures Dito 8 pCt.
59½, 60¾, 59½; Obl. St. Paul en Pacific Spw. 7 pCt. 56¼ Nieuwe
Russen 1866 154¾, 155.</Unicode></TextEquiv></TextRegion>
</Page>

现在,我需要做的是提取一组预选的TextRegion ID的x和y坐标。

首先,我试过

x <- as.numeric(unlist(sapply(xmlChildren(gt[["Page"]][["TextRegion"]][["Coords"]]), xmlGetAttr, "x")))
y <- as.numeric(unlist(sapply(xmlChildren(gt[["Page"]][["TextRegion"]][["Coords"]]), xmlGetAttr, "y")))

这很好用,但这只给了我第一个TextRegion的坐标。我需要能够获取任何给定ID的值。我该怎么做?

我试过

    coords <- as.data.frame(unlist(xpathSApply(gt, "//TextRegion[@id ='r2']/Coords/Point", xmlGetAttr, "x")))

但我只得到一个空的数据框。

我在这里缺少什么?

1 个答案:

答案 0 :(得分:1)

只需添加命名空间:

ns <- "http://schema.primaresearch.org/PAGE/gts/pagecontent/2010-03-19"

xpathSApply(
  gt, "//x:TextRegion[@id ='r2']/x:Coords/x:Point", 
  namespaces = c(x = ns), 
  xmlGetAttr, "x"))