如何从XML文件获取HTML结构

时间:2018-12-06 12:07:45

标签: java xml xml-parsing

假设xml文件如下:

<!DOCTYPE html [
<!ENTITY ldquo "&#x2665;">
]>
<DATA>
<ROW>
        <Id>29855</Id>
        <content><p>Did the summer fly as fast &ldquo;</p>
                  <a href="https://www.ex.com/" target="_blank"></content>
<ROW>
<ROW>
        <Id>11223</Id>
        <content><p>Fly as fast &ldquo;</p>
                  <a href="https://www.ex.com/" target="_blank"></content>
<ROW>
</DATA>

要求是从xml中获取“ id”和“ content”。内容应采用xml结构中的xml结构。就像:

<p>Fly as fast &ldquo;</p>
                  <a href="https://www.ex.com/" target="_blank">

我尝试过,但是我正在以字符串格式获取内容,例如:快飞“

这是我用来解析xml的代码:

File fXmlFile = new File("D:\\customer_connect_posts.xml");
            DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
            DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
            Document doc = dBuilder.parse(fXmlFile);
            doc.getDocumentElement().normalize();

            System.out.println("Root element :" + doc.getDocumentElement().getNodeName());
            NodeList nList = doc.getElementsByTagName("ROW");
            System.out.println("----------------------------");

            for (int temp = 0; temp < nList.getLength(); temp++) {
                Node nNode = nList.item(temp);
                System.out.println("\nCurrent Element :" + nNode.getNodeName());
                if (nNode.getNodeType() == Node.ELEMENT_NODE) {
                    Element eElement = (Element) nNode;
                    /*System.out.println("Staff id : "
                                       + eElement.getAttribute("Name"));*/
                    System.out.println("First Name : "
                                       + eElement.getElementsByTagName("Id")
                                         .item(0).getTextContent());
                    System.out.println("Last Name : "
                                       + eElement.getElementsByTagName("content").item(0).getTextContent())
                                         );
}
            }
            } catch (Exception e) {
            e.printStackTrace();
            }

问题是我正在调用“ getTextContent()”方法来返回文本。还有其他方法可以做到这一点。 需要帮助...

2 个答案:

答案 0 :(得分:0)

要从DOM std::vector<double> A_vec(n*n); // allocate data into A_vec Eigen::Map<Eigen::MatrixXd> A(A_vec.data(), n, n); // fill matrix A. // data is immediately stored into A_vec 的html中获取文本,应将其序列化为html。您可以使用Saxon并使用默认的Node Similar problem

Transformer

您应该看到下一个输出:

 Node content = eElement.getElementsByTagName("content").item(0);
 StringWriter sw = new StringWriter();
 Result result = new StreamResult(sw);
 TransformerFactory factory = new TransformerFactoryImpl();
 Transformer proc = factory.newTransformer();
 proc.setOutputProperty(OutputKeys.METHOD, "html");
 for (int i = 0; i < content.getChildNodes().getLength(); i++) {
     proc.transform(new DOMSource(content.getChildNodes().item(i)), result);
 }
 System.out.println("Content:" + sw.toString().trim());

并且在文档标签Current Element :ROW First Name : 29855 Content:<p>Did the summer fly as fast</p> <a href="https://www.ex.com/" target="_blank"></a> Current Element :ROW First Name : 11223 Content:<p>Fly as fast</p> <a href="https://www.ex.com/" target="_blank"></a> 中应使用<ROW>关闭。也适用于</ROW>。但是您可以使用简化的记录<a>

答案 1 :(得分:0)

您需要使用 CDATA 或对HTML进行编码以将HTML存储在XML内,否则HTML元素将被解释为XML元素。同样,您的ROW元素似乎没有关闭。 我建议像这样使用 CDATA

<DATA>
    <ROW>
        <Id>29855</Id>
        <content><![CDATA[<p>Did the summer fly as fast &ldquo;</p>
            <a href="https://www.ex.com/" target="_blank">]]>
        </content>
    </ROW>
    <ROW>
        <Id>11223</Id>
        <content><![CDATA[<p>Fly as fast &ldquo;</p>
            <a href="https://www.ex.com/" target="_blank">]]>
        </content>
    </ROW>
</DATA>