用Java提取所有SOAP XML节点文本

时间:2013-09-17 10:53:47

标签: java xml dom sax

我有以下SOAP XML,我想从中提取所有节点的文本内容:

<soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"
    xmlns:m="http://www.example.org/stock">
    <soap:Body>
        <m:GetStockName>
            <m:StockName>ABC</m:StockName>
        </m:GetStockName>
        <!--some comment-->
        <m:GetStockPrice>
            <m:StockPrice>10 \n </m:StockPrice>
            <m:StockPrice>\t20</m:StockPrice>
        </m:GetStockPrice>
    </soap:Body>
</soap:Envelope>

检测到的输出将是:

'ABC10 \n \t20'

我在 DOM

中完成了以下操作
public static String parseXmlDom() throws ParserConfigurationException,
        SAXException, IOException, FileNotFoundException {

    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    DocumentBuilder builder = factory.newDocumentBuilder();
    // Read XML File
    String xml = IOUtils.toString(new FileInputStream(new File(
            "./files/request2.xml")), "UTF-8");
    InputSource is = new InputSource(new StringReader(xml));
    // Parse XML String to DOM
    factory.setNamespaceAware(true);
    factory.setIgnoringComments(true);
    Document doc = builder.parse(is);
    // Extract nodes text
    NodeList nodeList = doc.getElementsByTagNameNS("*", "*");
    Node node = nodeList.item(0);
    return node.getTextContent();
}

使用 SAX

public static String parseXmlSax() throws SAXException, IOException, ParserConfigurationException {

    final StringBuffer sb = new StringBuffer();
    SAXParserFactory factory = SAXParserFactory.newInstance();
    SAXParser saxParser = factory.newSAXParser();
    // Declare Handler
    DefaultHandler handler = new DefaultHandler() {
        public void characters(char ch[], int start, int length) throws SAXException {
            sb.append((new String(ch, start, length)));
        }
    };
    // Parse XML
    saxParser.parse("./files/request2.xml", handler);
    return sb.toString();
}

对于我收到的两个方法:

'


            ABC



            10 \n 
            \t20


'

我知道我可以轻松地使用return sb.toString().replaceAll("\n", "").replaceAll("\t", "");来实现预期的结果,但是如果我的XML文件格式错误,例如有额外的空格,结果也会包含额外的空格。

另外,我尝试this approach在使用SAX或DOM解析之前将XML作为单行读取,但它不适用于我的SOAP XML示例,因为它修剪了{{1}之间的空格有断裂线时的属性(soap:Envelope):

xmlns:m

如果XML文件包含一行或多行/格式错误(忽略注释),我怎样才能读取SOAP XML中所有节点的文本内容?

0 个答案:

没有答案