我有以下SOAP XML,我想从中提取所有节点的文本内容:
<soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"
xmlns:m="http://www.example.org/stock">
<soap:Body>
<m:GetStockName>
<m:StockName>ABC</m:StockName>
</m:GetStockName>
<!--some comment-->
<m:GetStockPrice>
<m:StockPrice>10 \n </m:StockPrice>
<m:StockPrice>\t20</m:StockPrice>
</m:GetStockPrice>
</soap:Body>
</soap:Envelope>
检测到的输出将是:
'ABC10 \n \t20'
我在 DOM :
中完成了以下操作public static String parseXmlDom() throws ParserConfigurationException,
SAXException, IOException, FileNotFoundException {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
// Read XML File
String xml = IOUtils.toString(new FileInputStream(new File(
"./files/request2.xml")), "UTF-8");
InputSource is = new InputSource(new StringReader(xml));
// Parse XML String to DOM
factory.setNamespaceAware(true);
factory.setIgnoringComments(true);
Document doc = builder.parse(is);
// Extract nodes text
NodeList nodeList = doc.getElementsByTagNameNS("*", "*");
Node node = nodeList.item(0);
return node.getTextContent();
}
使用 SAX :
public static String parseXmlSax() throws SAXException, IOException, ParserConfigurationException {
final StringBuffer sb = new StringBuffer();
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
// Declare Handler
DefaultHandler handler = new DefaultHandler() {
public void characters(char ch[], int start, int length) throws SAXException {
sb.append((new String(ch, start, length)));
}
};
// Parse XML
saxParser.parse("./files/request2.xml", handler);
return sb.toString();
}
对于我收到的两个方法:
'
ABC
10 \n
\t20
'
我知道我可以轻松地使用return sb.toString().replaceAll("\n", "").replaceAll("\t", "");
来实现预期的结果,但是如果我的XML文件格式错误,例如有额外的空格,结果也会包含额外的空格。
另外,我尝试this approach在使用SAX或DOM解析之前将XML作为单行读取,但它不适用于我的SOAP XML示例,因为它修剪了{{1}之间的空格有断裂线时的属性(soap:Envelope
):
xmlns:m
如果XML文件包含一行或多行/格式错误(忽略注释),我怎样才能读取SOAP XML中所有节点的文本内容?