Question

我正在使用R和XML包解析瑞典图书馆目录。使用该库的API，我从包含我的查询的网址中获取XML。

我想使用xPath查询来解析每条记录，但我对XML-package的xPath所做的一切都会返回空白列表，除了＆＃34; // *＆＃34;之外的所有内容。我在xml-parsing和xPath方面都不是专家，但我怀疑它与我的API返回给我的xml有关。

这是目录中单个帖子的简单示例：

library(XML)

example.url <- "http://libris.kb.se/sru/swepub?version=1.1&operation=searchRetrieve&query=mat:dok&maximumRecords=1&recordSchema=mods"
doc = xmlParse(example.url)

# Title
works <- xmlRoot(doc)[[4]][["record"]][["recordData"]][["mods"]][["titleInfo"]][["title"]][[1]]
doesntwork <- getNodeSet(doc, "//title")

# The only xPath that returns anything
onlythisworks <- getNodeSet(doc, "//*")

如果这与名称空间（as these answers sugests）有关，那么我所知道的是，API返回的数据似乎在初始标记中定义了名称空间，并且我可以使用它，但是这个不帮助我：

# Namespaces are confusing:
title <- getNodeSet(xmlRoot(doc), "//xsi:title", namespaces = c(xsi = "http://www.w3.org/2001/XMLSchema-instance"))

此处（再次）the example return data我试图解析。

Answer 1

您必须使用正确的命名空间。请尝试以下

doesntwork <- getNodeSet(doc, "//mods:title")
#[[1]]
#<title>Horizontal Slot Waveguides for Silicon Photonics Back-End Integration [Elektronisk resurs]</title> 
#
#[[2]]
#<title>TRITA-ICT/MAP AVH, 2014:17                      \
#                           </title> 
#
#attr(,"class")
#[1] "XMLNodeSet"

BTW：我通常通过

获取名称空间

nsDefs=xmlNamespaceDefinitions(doc,simplify = TRUE,recursive=TRUE)

但这会引发你的错误。它抱怨有different URIs for the same name space prefix。根据 this site这似乎不是很好的编码风格。

根据OP的评论更新

我自己不是xml专家，但这是我的看法：您可以通过<tag xmlns=URI>定义默认命名空间。非默认名称空间的格式为<tag xmlns:a=URI>，其中a是相应的名称空间名称。您的文档的问题是有两个不同的默认命名空间。第一个是<searchRetrieveResponse xmlns="http://www.loc.gov/zing/srw/" ... >。第二个是<mods xmlns="http://www.loc.gov/mods/v3" ... >。此外，您将在第一个标记中找到第二个默认命名空间URI xmlns:mods="http://www.loc.gov/mods/v3"（非默认值）。这看起来相当混乱。现在，<title>标记位于<mods>标记内。我认为<mods>中定义的默认命名空间被searchRetrieveResponse的非默认命名空间覆盖（因为它们具有相同的URI）。因此，虽然<mods>和所有后续标记（如<title>）似乎具有默认命名空间，但它们实际上具有xmlns:mods命名空间。但这不适用于代码<numberOfRecords>（因为它在<mods>之外）。您可以通过

访问此节点

getNodeSet(doc, "//ns:numberOfRecords",
       namespaces = c(ns="http://www.loc.gov/zing/srw/"))

在这里，您提取<searchRetrieveResponse>中定义的默认命名空间，并为其命名（在我们的示例中为ns）。然后，您可以在xPath查询中显式使用默认命名空间名称。

为什么＆＃34; // *＆＃34;我在使用XML包在R中解析此XML时唯一有效的xPath查询？

1 个答案: