Apache Tika:如何使用XPath查询

时间:2015-11-09 15:21:28

标签: java xml xpath apache-tika

我正在使用Apache Tika解析XML文件。我想从XML中提取某些带有内容的标记,并将它们存储在HashMap中。现在,我可以提取XML的全部内容,但标签丢失

  //detecting the file type
  BodyContentHandler handler = new BodyContentHandler();

  Metadata metadata = new Metadata();
  FileInputStream inputstream = null;

try 
{
    inputstream = new FileInputStream(new File(ParseXML.class.getClassLoader().getResource("xml/a.xml").toURI()));
}
catch (URISyntaxException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
}

ParseContext pcontext = new ParseContext();

  //Xml parser
  XMLParser xmlparser = new XMLParser(); 
  xmlparser.parse(inputstream, handler, metadata, pcontext);
  System.out.println("Contents of the document:" + handler.toString());
  System.out.println("Metadata of the document:");
  String[] metadataNames = metadata.names();

  for(String name : metadataNames) {
     System.out.println(name + ": " + metadata.get(name));

  }

向我展示了XML的全部内容

现在,我想提取XML的某些部分,因为Tika允许XPath查询,我试过这个

XPathParser xhtmlParser = new XPathParser("xhtml", XHTMLContentHandler.XHTML);
      Matcher divContentMatcher = xhtmlParser.parse("/Product/Source/Publisher/PublisherName[@nameType='Person']");
      ContentHandler xhandler = new MatchingContentHandler(
              new ToXMLContentHandler(), divContentMatcher);

      AutoDetectParser parser = new AutoDetectParser();
      Metadata xmetadata = new Metadata();
      try  (FileInputStream stream = new FileInputStream(new File(ParseXML.class.getClassLoader().getResource("xml/a.xml").toURI()))) {
          parser.parse(stream, xhandler, xmetadata);
          System.out.println(xhandler.toString());
      } catch (URISyntaxException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
   }

但它没有显示任何输出!我希望它只会给我XQuery中指定的节点。

知道发生了什么事吗?

顺便说一下,这里是相应的XML

<Product productID="xvc22" shortProductID="x" language="en">
  <ProductStatus statusType="Published" /> 
   <Source>
  <Publisher sequence="1" primaryIndicator="Yes">
  <PublisherID idType="Shortname">jjkjkj</PublisherID> 
  <PublisherID idType="BM">6666</PublisherID> 
  <PublisherName nameType="Legal">ABT</PublisherName> 
  <PublisherName nameType="Person">
  <LastName>pppp</LastName> 
  <FirstName>lkkk</FirstName> 
  </PublisherName>
  </Publisher>
  </Source>
  </Product>

另外,当我在

上测试查询时

http://www.freeformatter.com/xpath-tester.html

我看到了正确的结果,即

Element='<PublisherName nameType="Person">
  <LastName>pppp</LastName>
  <FirstName>lkkk</FirstName>
</PublisherName>'

这是JAVA或Tika的一些语法问题吗?

修改

请注意,如果我在没有Tika的情况下进行解析,则可以正常工作

      DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
      DocumentBuilder builder = factory.newDocumentBuilder();
      Document doc = builder.parse(new File(ParseXML.class.getClassLoader().getResource("xml/a.xml").toURI()));
      XPathFactory xPathfactory = XPathFactory.newInstance();
      XPath xpath = xPathfactory.newXPath();
      XPathExpression expr = xpath.compile("/Product/Source/Publisher/PublisherName[@nameType='Person']");

      System.out.println(expr.evaluate(doc, XPathConstants.STRING));

打印出来

pppp
lkkk

这是完美的。那么为什么Tika不能解析XPath查询呢?

0 个答案:

没有答案