使用Java从MediaWiki读取XML标记

时间:2014-09-03 13:57:27

标签: java xml xpath mediawiki

我需要阅读'搜索'的输出标记来自以下url usign Java。

首先,我需要从以下URL读取XML到某些字符串: http://en.wikipedia.org/w/api.php?format=xml&action=query&list=search&srlimit=1&srsearch=big+brother

我应该最终得到这个:

<api>
<query-continue>
<search sroffset="1"/>
</query-continue>
<query>
<searchinfo totalhits="55180"/>
<search>
<p ns="0" title="Big Brothers Big Sisters of America" snippet="<span class='searchmatch'>Big</span> <span class='searchmatch'>Brothers</span> <span class='searchmatch'>Big</span> Sisters of America is a 501(c)(3) non-profit organization whose goal is to help all children reach their potential through <b>...</b> " size="13008" wordcount="1906" timestamp="2014-04-15T06:46:01Z"/>
</search>
</query>
</api>

然后,一旦我拥有XML,我需要获取搜索标记的内容: 输出&#39;搜索&#39;标签看起来像这样,我需要从中间的代码中得到两个部分:

<search>
<p ns="0" title="Big Brothers Big Sisters of America" snippet="<span class='searchmatch'>Big</span> <span class='searchmatch'>Brothers</span> <span class='searchmatch'>Big</span> Sisters of America is a 501(c)(3) non-profit organization whose goal is to help all children reach their potential through <b>...</b> " size="13008" wordcount="1906" timestamp="2014-04-15T06:46:01Z"/>
</search>

最后,我需要的是两个字符串,它们等于:

String title = Big Brothers Big Sisters of America
String snippet = "<span class='searchmatch'>Big..."

有人可以帮我修改这段代码吗,我不确定我做错了什么。我不认为它甚至从url中检索XML,更不用说XML中的标记了。

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse("http://en.wikipedia.org/w/api.php?format=xml&action=query&list=search&srlimit=1&srsearch=big+brother");
doc.getDocumentElement().normalize();

XPathFactory xFactory = XPathFactory.newInstance();
XPath xpath = xFactory.newXPath();
XPathExpression expr = xpath.compile("//query/search/text()");
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
for (int i=0; i<nodes.getLength();i++){
System.out.println(nodes.item(i).getNodeValue());
}

抱歉,我是新手,无法在任何地方找到答案。

1 个答案:

答案 0 :(得分:2)

这里的主要问题是你要求的文本节点是<search>的子节点,但事实上你想要的<p ..>不是文本节点:它是一个元素。 (事实上​​,<search>元素没有文本节点子节点,因为您可以使用“查看源”查看从该URL查看响应的时间。)

所以你要做的就是将XPath表达式改为

//query/search/p

将为您提供p元素节点。然后在Java代码中询问此节点的两个属性titlesnippet的值:

Element e = (Element)(nodes.item(i));
String title = e.getAttribute("title");
String snippet = e.getAttribute("snippet");

或者,您可以执行两个XPath查询,每个属性一个:

//query/search/p/@title

//query/search/p/@snippet

假设只有一个<p>元素。如果您在多个<p>元素上执行此操作,则可能希望将每对属性保留在一起,而不是具有两个单独的结果列表。