我正在使用Apache Tika解析XML文件。我想从XML中提取某些带有内容的标记,并将它们存储在HashMap中。现在,我可以提取XML的全部内容,但标签丢失
//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = null;
try
{
inputstream = new FileInputStream(new File(ParseXML.class.getClassLoader().getResource("xml/a.xml").toURI()));
}
catch (URISyntaxException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
ParseContext pcontext = new ParseContext();
//Xml parser
XMLParser xmlparser = new XMLParser();
xmlparser.parse(inputstream, handler, metadata, pcontext);
System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
向我展示了XML的全部内容
现在,我想提取XML的某些部分,因为Tika允许XPath查询,我试过这个
XPathParser xhtmlParser = new XPathParser("xhtml", XHTMLContentHandler.XHTML);
Matcher divContentMatcher = xhtmlParser.parse("/Product/Source/Publisher/PublisherName[@nameType='Person']");
ContentHandler xhandler = new MatchingContentHandler(
new ToXMLContentHandler(), divContentMatcher);
AutoDetectParser parser = new AutoDetectParser();
Metadata xmetadata = new Metadata();
try (FileInputStream stream = new FileInputStream(new File(ParseXML.class.getClassLoader().getResource("xml/a.xml").toURI()))) {
parser.parse(stream, xhandler, xmetadata);
System.out.println(xhandler.toString());
} catch (URISyntaxException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
但它没有显示任何输出!我希望它只会给我XQuery中指定的节点。
知道发生了什么事吗?
顺便说一下,这里是相应的XML<Product productID="xvc22" shortProductID="x" language="en">
<ProductStatus statusType="Published" />
<Source>
<Publisher sequence="1" primaryIndicator="Yes">
<PublisherID idType="Shortname">jjkjkj</PublisherID>
<PublisherID idType="BM">6666</PublisherID>
<PublisherName nameType="Legal">ABT</PublisherName>
<PublisherName nameType="Person">
<LastName>pppp</LastName>
<FirstName>lkkk</FirstName>
</PublisherName>
</Publisher>
</Source>
</Product>
另外,当我在
上测试查询时http://www.freeformatter.com/xpath-tester.html
我看到了正确的结果,即
Element='<PublisherName nameType="Person">
<LastName>pppp</LastName>
<FirstName>lkkk</FirstName>
</PublisherName>'
这是JAVA或Tika的一些语法问题吗?
修改
请注意,如果我在没有Tika的情况下进行解析,则可以正常工作
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(new File(ParseXML.class.getClassLoader().getResource("xml/a.xml").toURI()));
XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile("/Product/Source/Publisher/PublisherName[@nameType='Person']");
System.out.println(expr.evaluate(doc, XPathConstants.STRING));
打印出来
pppp
lkkk
这是完美的。那么为什么Tika不能解析XPath查询呢?