所以我试图在标签之间获取文本。到目前为止,我已经成功了。但有时当我的自定义标签中有特殊字符或html标签时,我无法获取文本。示例xml看起来像
<records>
<car name='HSV Maloo' make='Holden' year='2006'>
<ae_definedTermTitleBegin />Australia<ae_definedTermTitleEnd />
<ae_clauseTitleBegin />1.02 <u>Accounting Terms</u>.<ae_clauseTitleEnd />
</car>
<car name='P50' make='Peel' year='1962'>
<ae_definedTermTitleBegin />Isle of Man<ae_definedTermTitleEnd />
<ae_clauseTitleBegin />Smallest Street-Legal Car at 99cm wide and 59 kg in weight<ae_clauseTitleEnd />
</car>
<car name='Royale' make='Bugatti' year='1931'>
<ae_definedTermTitleBegin />France<ae_definedTermTitleEnd />
<ae_clauseTitleBegin />Most Valuable Car at $15 million<ae_clauseTitleEnd />
</car>
</records>
我得到的输出是
[Australia, Isle of Man, France]
[., Smallest Street-Legal Car at 99cm wide and 59 kg in weight, Most Valuable Car at $15 million]
正如您所见,缺少“会计条款”。我得到的只是一个点。我该如何纠正?
sax解析器代码
import javax.xml.parsers.SAXParserFactory
import org.xml.sax.helpers.DefaultHandler
import org.xml.sax.*
class SAXXMLParser extends DefaultHandler {
def DefinedTermTitles = []
def ClauseTitles = []
def currentMessage
def countryFlag = false
void startElement(String ns, String localName, String qName, Attributes atts) {
switch (qName) {
case 'ae_clauseTitleBegin':
//messages.add(currentMessage)
countryFlag = true;
break
case 'ae_definedTermTitleBegin':
//messages.add(currentMessage)
countryFlag = true;
break
}
}
void characters(char[] chars, int offset, int length) {
if (countryFlag) {
currentMessage = new String(chars, offset, length)
println(currentMessage)
}
}
void endElement(String ns, String localName, String qName) {
switch (qName) {
case 'ae_clauseTitleEnd':
ClauseTitles.add(currentMessage)
countryFlag = false;
break
case 'ae_definedTermTitleEnd':
DefinedTermTitles.add(currentMessage)
countryFlag = false;
break
}
}
}
答案 0 :(得分:0)
我不熟悉Groovy所以这里是Java的解决方案。我相信翻译是直截了当的。
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.ArrayList;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.helpers.DefaultHandler;
public class SaxHandler extends DefaultHandler {
ArrayList<String> DefinedTermTitles = new ArrayList<>();
ArrayList<String> ClauseTitles = new ArrayList<>();
String currentMessage;
boolean countryFlag = false;
StringBuilder message = new StringBuilder();
public void startElement(String ns, String localName, String qName, Attributes atts) {
switch (qName) {
case "ae_clauseTitleBegin":
countryFlag = true;
break;
case "ae_definedTermTitleBegin":
countryFlag = true;
break;
}
}
public void characters(char[] chars, int offset, int length) {
if (countryFlag) {
message.append(new String(chars, offset, length));
}
}
public void endElement(String ns, String localName, String qName) {
switch (qName) {
case "ae_clauseTitleEnd":
ClauseTitles.add(message.toString());
countryFlag = false;
message.setLength(0);
break;
case "ae_definedTermTitleEnd":
DefinedTermTitles.add(message.toString());
countryFlag = false;
message.setLength(0);
break;
}
}
public static void main (String argv []) {
SAXParserFactory factory = SAXParserFactory.newInstance();
try {
String path = "INPUT_PATH_HERE";
InputStream xmlInput = new FileInputStream(path + "test.xml");
SAXParser saxParser = factory.newSAXParser();
SaxHandler handler = new SaxHandler();
saxParser.parse(xmlInput, handler);
System.out.println(handler.DefinedTermTitles);
System.out.println(handler.ClauseTitles);
} catch (Exception err) {
err.printStackTrace ();
}
}
}
<强>输出强>
[Australia, Isle of Man, France]
[1.02 Accounting Terms., Smallest Street-Legal Car at 99cm wide and 59 kg in weight, Most Valuable Car at $15 million]
答案 1 :(得分:0)
由于您现在已针对不同的库提出此问题,因此这是一个XMLParser
的解决方案。这篇XML的作者可能不太了解XML的工作原理。如果我在哪里,我宁愿进行一些过滤,以便再次理智(例如<tagBegin/>X<tagEnd/>
到<tag>x</tag>
)。
def xml = '''\
<records>
<car name='HSV Maloo' make='Holden' year='2006'>
<ae_definedTermTitleBegin />Australia<ae_definedTermTitleEnd />
<ae_clauseTitleBegin />1.02 <u>Accounting Terms</u>.<ae_clauseTitleEnd />
</car>
<car name='P50' make='Peel' year='1962'>
<ae_definedTermTitleBegin />Isle of Man<ae_definedTermTitleEnd />
<ae_clauseTitleBegin />Smallest Street-Legal Car at 99cm wide and 59 kg in weight<ae_clauseTitleEnd />
</car>
<car name='Royale' make='Bugatti' year='1931'>
<ae_definedTermTitleBegin />France<ae_definedTermTitleEnd />
<ae_clauseTitleBegin />Most Valuable Car at $15 million<ae_clauseTitleEnd />
</car>
</records>
'''
def underp = { l ->
l.inject([texts: [:]]) { r, it ->
if (it.respondsTo('name') && it.name().endsWith('Begin')) {
r.texts[(r.last=it.name().replaceFirst(/Begin$/,''))] = ''
} else if (it.respondsTo('name') && it.name().endsWith('End')) {
r.last = null
} else if (r.last) {
r.texts[r.last] += (it instanceof String) ? it : it.text()
}
r
}.texts
}
def root = new XmlParser().parseText(xml)
root.car.each{
println underp(it.children()).inspect()
}
打印
['ae_definedTermTitle':'Australia', 'ae_clauseTitle':'1.02 Accounting Terms.']
['ae_definedTermTitle':'Isle of Man', 'ae_clauseTitle':'Smallest Street-Legal Car at 99cm wide and 59 kg in weight']
['ae_definedTermTitle':'France', 'ae_clauseTitle':'Most Valuable Car at $15 million']