解析XML在CDATA中与无效HTML混合转义

时间:2013-12-10 11:16:01

标签: java xml string escaping cdata

我在Web服务响应中有以下元素。正如您所看到的,它被转义为CDATA的转义XML,因此XML解析器只是将其视为一个字符串,我无法通过XSLT和XPath的常用方法从中获取所需的数据。我需要将这个丑陋的字符串转换回XML,以便我能够正确阅读它。

我尝试过进行搜索替换,只是简单地将所有&lt;转换为<&gt;转换为>这样做效果很好,但存在问题: message.body元素实际上可以包含不是有效XML的HTML。对于我所知道的,我甚至可能都不是有效的HTML。因此,如果我只是替换所有内容,当我尝试将字符串转换回XML文档时,这可能会崩溃。

我怎样才能安全地解决这个问题?有没有一种好方法可以在message.body打开和关闭标记之间的之外进行替换?

<output>&lt;item type="object"&gt;
  &lt;ticket.id type="string"&gt;171&lt;/ticket.id&gt;
  &lt;ticket.title type="string"&gt;SoapUI Test&lt;/ticket.title&gt;
  &lt;ticket.created_at type="string"&gt;2013-12-03 12:50:54&lt;/ticket.created_at&gt;
  &lt;ticket.status type="string"&gt;Open&lt;/ticket.status&gt;
  &lt;updated type="string"&gt;false&lt;/updated&gt;
  &lt;message type="object"&gt;
    &lt;message.id type="string"&gt;520&lt;/message.id&gt;
    &lt;message.created_at type="string"&gt;2013-12-03 12:50:54.000&lt;/message.created_at&gt;
    &lt;message.author type="string"/&gt;
    &lt;message.body type="string"&gt;Just a test message...&lt;/message.body&gt;
  &lt;/message&gt;
  &lt;message type="object"&gt;
    &lt;message.id type="string"&gt;521&lt;/message.id&gt;
    &lt;message.created_at type="string"&gt;2013-12-03 13:58:32.000&lt;/message.created_at&gt;
    &lt;message.author type="string"/&gt;
    &lt;message.body type="string"&gt;Another message!&lt;/message.body&gt;
  &lt;/message&gt;
&lt;/item&gt;
</output>

2 个答案:

答案 0 :(得分:0)

这实际上是从我正在进行的项目中解脱出来的。

    private Node stringToNode(String textContent) {
    Element node = null;
    try {
        node = DocumentBuilderFactory.newInstance().newDocumentBuilder()
                .parse(new ByteArrayInputStream(textContent.getBytes()))
                .getDocumentElement();

    } catch (SAXException e) {
        logger.error(e.getMessage(), e);
    } catch (IOException e) {
        logger.error(e.getMessage(), e);
    } catch (ParserConfigurationException e) {
        logger.error(e.getMessage(), e);
    }
    return node;
}

这将为您提供表示字符串的文档对象。我使用它将其恢复到原始文档中:

if (textContent.contains(XML_HEADER)) {
  textContent = textContent.substring(textContent.indexOf(XML_HEADER) + XML_HEADER.length());
}
Node newNode = stringToNode(textContent);
if (newNode != null) {
  Node importedNode = soapBody.getOwnerDocument().importNode(newNode, true);
  nextChild.setTextContent(null);
  nextChild.appendChild(importedNode);
}

答案 1 :(得分:0)

这是我目前的解决方案。您为乱搞的节点和一组可能包含混乱的HTML和其他问题的元素名称提供XPath。大致如下工作

  1. 拉出与XPATH匹配的节点的文本内容
  2. 运行正则表达式以包装CDATA中有问题的子元素
  3. 在临时元素中包装文本(否则,如果有多个根节点,则会崩溃)
  4. 将文本解析回DOM
  5. 添加临时节点的子节点以代替以前的文本内容。
  6. 步骤2中的正则表达式解决方案可能不是万无一失的,但目前还没有真正看到更好的解决方案。如果你这样做,请告诉我!

    <强> CDataFixer

    import java.util.*;    
    import javax.xml.xpath.*;    
    import org.w3c.dom.*;
    
    public class CDataFixer
    {
        private final XmlHelper xml = XmlHelper.getInstance();
    
        public Document fix(Document document, String nodesToFix, Set<String> excludes) throws XPathExpressionException, XmlException
        {
            return fix(document, xml.newXPath().compile(nodesToFix), excludes);
        }
    
        private Document fix(Document document, XPathExpression nodesToFix, Set<String> excludes) throws XPathExpressionException, XmlException
        {
            Document wc = xml.copy(document); 
    
            NodeList nodes = (NodeList) nodesToFix.evaluate(wc, XPathConstants.NODESET);
            int nodeCount = nodes.getLength();
    
            for(int n=0; n<nodeCount; n++)
                parse(nodes.item(n), excludes);
    
            return wc;
        }
    
        private void parse(Node node, Set<String> excludes) throws XmlException
        {
            String text = node.getTextContent();
    
            for(String exclude : excludes)
            {
                String regex = String.format("(?s)(<%1$s\\b[^>]*>)(.*?)(</%1$s>)", Pattern.quote(exclude));
                text = text.replaceAll(regex, "$1<![CDATA[$2]]>$3");
            }
    
            String randomNode = "tmp_"+UUID.randomUUID().toString();
    
            text = String.format("<%1$s>%2$s</%1$s>", randomNode, text);
    
            NodeList parsed = xml
                .parse(text)
                .getFirstChild()
                .getChildNodes();
    
            node.setTextContent(null);
            for(int n=0; n<parsed.getLength(); n++)
                node.appendChild(node.getOwnerDocument().importNode(parsed.item(n), true));
        }
    }
    

    <强> XmlHelper

    import java.io.*;    
    import javax.xml.parsers.*;
    import javax.xml.transform.*;
    import javax.xml.transform.dom.*;
    import javax.xml.transform.sax.*;
    import javax.xml.transform.stream.*;
    import javax.xml.xpath.*;    
    import org.w3c.dom.*;
    import org.xml.sax.*;
    
    public final class XmlHelper
    {
        private static final XmlHelper instance = new XmlHelper(); 
        public static XmlHelper getInstance()
        {
            return instance;
        }
    
    
        private final SAXTransformerFactory transformerFactory;
        private final DocumentBuilderFactory documentBuilderFactory;
        private final XPathFactory xpathFactory;
    
        private XmlHelper()
        {
            documentBuilderFactory = DocumentBuilderFactory.newInstance();
            documentBuilderFactory.setNamespaceAware(true);
    
            xpathFactory = XPathFactory.newInstance();
    
            TransformerFactory tf = TransformerFactory.newInstance();
            if (!tf.getFeature(SAXTransformerFactory.FEATURE))
                throw new RuntimeException("Failed to create SAX-compatible TransformerFactory.");
            transformerFactory = (SAXTransformerFactory) tf;
        }
    
        public DocumentBuilder newDocumentBuilder()
        {
            try
            {
                return documentBuilderFactory.newDocumentBuilder();
            }
            catch (ParserConfigurationException e)
            {
                throw new RuntimeException("Failed to create new "+DocumentBuilder.class, e);
            }
        }
    
        public XPath newXPath()
        {
            return xpathFactory.newXPath();
        }
    
        public Transformer newIdentityTransformer(boolean omitXmlDeclaration, boolean indent)
        {
            try
            {
                Transformer transformer = transformerFactory.newTransformer();
                transformer.setOutputProperty(OutputKeys.INDENT, indent ? "yes" : "no");
                transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, omitXmlDeclaration ? "yes" : "no");
                return transformer;
            }
            catch (TransformerConfigurationException e)
            {
                throw new RuntimeException("Failed to create Transformer instance: "+e.getMessage(), e);
            }
        }
    
        public Templates newTemplates(String xslt) throws XmlException
        {
            try
            {
                return transformerFactory.newTemplates(new DOMSource(parse(xslt)));
            }
            catch (TransformerConfigurationException e)
            {
                throw new RuntimeException("Failed to create templates: "+e.getMessage(), e);
            }
        }
    
        public Document parse(String xml) throws XmlException
        {
            return parse(new InputSource(new StringReader(xml)));
        }
    
        public Document parse(InputSource xml) throws XmlException
        {
            try
            {
                return newDocumentBuilder().parse(xml);
            }
            catch (SAXException e)
            {
                throw new XmlException("Failed to parse xml: "+e.getMessage(), e);
            }
            catch (IOException e)
            {
                throw new XmlException("Failed to read xml: "+e.getMessage(), e);
            }
        }
    
        public String toString(Node node)
        {
            return toString(node, true, false);
        }
    
        public String toString(Node node, boolean omitXMLDeclaration, boolean indent)
        {
            try
            {
                StringWriter writer = new StringWriter();
    
                newIdentityTransformer(omitXMLDeclaration, indent)
                    .transform(new DOMSource(node), new StreamResult(writer));
    
                return writer.toString();
            }
            catch (TransformerException e)
            {
                throw new RuntimeException("Failed to transform XML into string: " + e.getMessage(), e);
            }
        }
    
        public Document copy(Document document)
        {
            DOMSource source = new DOMSource(document);
            DOMResult result = new DOMResult();
    
            try
            {
                newIdentityTransformer(true, false)
                    .transform(source, result);
                return (Document) result.getNode();
            }
            catch (TransformerException e)
            {
                throw new RuntimeException("Failed to copy XML: " + e.getMessage(), e);
            }
        }
    }