我有一个类似这样的XML文件:
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2010-03-19" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2010-03-19 http://schema.primaresearch.org/PAGE/gts/pagecontent/2010-03-19/pagecontent.xsd" pcGtsId="pc-00530982">
<Metadata>
<Page imageFilename="00530982.tif" imageWidth="3346" imageHeight="5328">
<TextRegion id="r2" readingDirection="left-to-right" type="paragraph">
<Coords>
<Point x="94" y="3372"/>
<Point x="356" y="3375"/>
<Point x="326" y="3375"/>
<Point x="317" y="3369"/>
<Point x="160" y="3368"/>
<Point x="152" y="3368"/></Coords>
<TextEquiv>
<Unicode>Obl. Atl. Gr. W. Spw. 7 pCt. 52½, ⅞, ¾; Debentures Dito 8 pCt.
59½, 60¾, 59½; Obl. St. Paul en Pacific Spw. 7 pCt. 56¼ Nieuwe
Russen 1866 154¾, 155.</Unicode></TextEquiv></TextRegion>
</Page>
现在,我需要做的是提取一组预选的TextRegion ID的x和y坐标。
首先,我试过
x <- as.numeric(unlist(sapply(xmlChildren(gt[["Page"]][["TextRegion"]][["Coords"]]), xmlGetAttr, "x")))
y <- as.numeric(unlist(sapply(xmlChildren(gt[["Page"]][["TextRegion"]][["Coords"]]), xmlGetAttr, "y")))
这很好用,但这只给了我第一个TextRegion的坐标。我需要能够获取任何给定ID的值。我该怎么做?
我试过
coords <- as.data.frame(unlist(xpathSApply(gt, "//TextRegion[@id ='r2']/Coords/Point", xmlGetAttr, "x")))
但我只得到一个空的数据框。
我在这里缺少什么?
答案 0 :(得分:1)
只需添加命名空间:
ns <- "http://schema.primaresearch.org/PAGE/gts/pagecontent/2010-03-19"
xpathSApply(
gt, "//x:TextRegion[@id ='r2']/x:Coords/x:Point",
namespaces = c(x = ns),
xmlGetAttr, "x"))