我有一个很大的字符串,它表示XML。我正尝试如下提取节点数据:
String textToExtract = "<FnAnno>\r\n" +
" <PropDesc F_ANNOTATEDID=\"{60431964-0000-C411-9979-E6A21CEE873F}\" F_BACKCOLOR=\"0\" F_BORDER_BACKMODE=\"2\" F_BORDER_COLOR=\"0\" F_BORDER_STYLE=\"0\" F_BORDER_WIDTH=\"1\" F_CLASSID=\"{5CF11941-018F-11D0-A87A-00A0246922A5}\" F_CLASSNAME=\"Text\" F_CREATOR=\"req92333\" F_ENTRYDATE=\"2018-06-19T13:15:43.0000000-05:00\" F_FONT_BOLD=\"true\" F_FONT_ITALIC=\"false\" F_FONT_NAME=\"arial\" F_FONT_SIZE=\"12\" F_FONT_STRIKETHROUGH=\"false\" F_FONT_UNDERLINE=\"false\" F_FORECOLOR=\"0\" F_HASBORDER=\"true\" F_HEIGHT=\"0\" F_ID=\"{60431964-0000-C411-9979-E6A21CEE873F}\" F_LEFT=\"3.430379746835443\" F_MODIFYDATE=\"2018-06-19T13:15:49.0000000-05:00\" F_MULTIPAGETIFFPAGENUMBER=\"1\" F_NAME=\"-1-1\" F_PAGENUMBER=\"1\" F_TEXT_BACKMODE=\"2\" F_TOOLTIP=\"0043007200650061007400650064002000420079003A002000720065007100390032003300330033002C0020002000430072006500610074006500640020004F006E003A002000320030003100380020004A0075006E0065002000310039002C002000310033003A00310035003A00340033002C0020005500540043002D0035\" F_TOOLTIPTRANSFERENCODING=\"hex\" F_TOP=\"1.3291139240506329\" F_WIDTH=\"0\">\r\n" +
" <F_CUSTOM_BYTES/>\r\n" +
" <F_POINTS/>\r\n" +
" <F_TEXT Encoding=\"unicode\">005400680069007300200069007300200061002000740065007300740020000A00280041006200680069006c0061007300680020004d007500740068007500720061006a00200036002f00310039002f00320030003100380029</F_TEXT>\r\n" +
" </PropDesc>\r\n" +
"</FnAnno>";
String extractedString =textToExtract.substring(textToExtract.indexOf("=\"unicode\">"),textToExtract.indexOf("</F_TEXT>")).replaceFirst("=\"unicode\">", "");
结果是005400680069007300200069007300200061002000740065007300740020000A00280041006200680069006c0061007300680020004d007500740068007007500720061006a00200036002f00310039002f00320030003100380029
为了提高效率,我想使用模式和匹配器来提取子字符串。以下是我正在努力尝试的代码:
Pattern pattern = Pattern.compile("\\bEncoding=.*?\\.*F_TEXT\\b");
Matcher matcher = pattern.matcher(textToExtract);
while (matcher.find()){
extractedString = (matcher.group());
}
上面的结果是Encoding =“ unicode”> 005400680069007,我再次需要截断它。
如何仅获取<F_TEXT Encoding=\"unicode\"> and </F_TEXT>
之间的数据?我在正则表达式中在学校遇到问题,甚至现在在工作中都有问题:(猜猜我需要练习很多。
谢谢。
答案 0 :(得分:1)
不要使用正则表达式来解析XML。使用XML解析器。
要“提高效率”,请使用SAX,例如像这样:
String textToExtract = "<FnAnno>\r\n" +
" <PropDesc F_ANNOTATEDID=\"{60431964-0000-C411-9979-E6A21CEE873F}\" F_BACKCOLOR=\"0\" F_BORDER_BACKMODE=\"2\" F_BORDER_COLOR=\"0\" F_BORDER_STYLE=\"0\" F_BORDER_WIDTH=\"1\" F_CLASSID=\"{5CF11941-018F-11D0-A87A-00A0246922A5}\" F_CLASSNAME=\"Text\" F_CREATOR=\"req92333\" F_ENTRYDATE=\"2018-06-19T13:15:43.0000000-05:00\" F_FONT_BOLD=\"true\" F_FONT_ITALIC=\"false\" F_FONT_NAME=\"arial\" F_FONT_SIZE=\"12\" F_FONT_STRIKETHROUGH=\"false\" F_FONT_UNDERLINE=\"false\" F_FORECOLOR=\"0\" F_HASBORDER=\"true\" F_HEIGHT=\"0\" F_ID=\"{60431964-0000-C411-9979-E6A21CEE873F}\" F_LEFT=\"3.430379746835443\" F_MODIFYDATE=\"2018-06-19T13:15:49.0000000-05:00\" F_MULTIPAGETIFFPAGENUMBER=\"1\" F_NAME=\"-1-1\" F_PAGENUMBER=\"1\" F_TEXT_BACKMODE=\"2\" F_TOOLTIP=\"0043007200650061007400650064002000420079003A002000720065007100390032003300330033002C0020002000430072006500610074006500640020004F006E003A002000320030003100380020004A0075006E0065002000310039002C002000310033003A00310035003A00340033002C0020005500540043002D0035\" F_TOOLTIPTRANSFERENCODING=\"hex\" F_TOP=\"1.3291139240506329\" F_WIDTH=\"0\">\r\n" +
" <F_CUSTOM_BYTES/>\r\n" +
" <F_POINTS/>\r\n" +
" <F_TEXT Encoding=\"unicode\">005400680069007300200069007300200061002000740065007300740020000A00280041006200680069006c0061007300680020004d007500740068007500720061006a00200036002f00310039002f00320030003100380029</F_TEXT>\r\n" +
" </PropDesc>\r\n" +
"</FnAnno>";
StringBuilder buf = new StringBuilder();
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
parser.parse(new InputSource(new StringReader(textToExtract)), new DefaultHandler() {
private boolean captureText;
@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
this.captureText = qName.equals("F_TEXT");
}
@Override
public void endElement(String uri, String localName, String qName) throws SAXException {
this.captureText = false;
}
@Override
public void characters(char[] ch, int start, int length) throws SAXException {
if (this.captureText)
buf.append(ch, start, length);
}
});
System.out.println(buf.toString());
输出
005400680069007300200069007300200061002000740065007300740020000A00280041006200680069006c0061007300680020004d007500740068007500720061006a00200036002f00310039002f00320030003100380029
答案 1 :(得分:1)
如果您总是要在相同的XML标记之间检索数据,那么您不必担心将其解析为数据结构。你有正确的主意。如果您追求的是速度,那么只需抓住已知标记之间的字符串即可。
但是,您的方式浪费了一些周期。
textToExtract.substring(textToExtract.indexOf("=\"unicode\">"),textToExtract.indexOf("</F_TEXT>")).replaceFirst("=\"unicode\">", "");
让我们分解一下:
// loops through the array until "=\"unicode\">" is found
int startIndex = textToExtract.indexOf("=\"unicode\">");
// loops through the array again, until "</F_TEXT>" is found
int endIndex = textToExtract.indexOf("</F_TEXT>");
//loop through the array, copying the bytes to a new array to form a new String
String substr = textToExtract.substring(startIndex,endIndex);
//loop through the array to find and replace "=\"unicode\">" with nothing
String data = substr.replaceFirst("=\"unicode\">", "");
您在同一数组中循环很多次。
一旦您知道起点在哪里,就无需再次从头开始搜索。而是从该起点开始寻找。然后,一旦有了子字符串的起点和终点,就可以轻松获得它。
// we know what precedes the substring we want
String anchor = "<F_TEXT Encoding=\"unicode\">";
// so we use it to get the start point, looping once, up to that point
int start = textToExtract.indexOf(anchor)+anchor.length();
// we know the end point won't be before the start point, so start where it left off
int end = start;
// count each character from that point until the next XML tag starts
while (textToExtract.charAt(end) != '<') { end++; }
// now we have what we need to simply get the substring
String data = textToExtract.substring(start,end);
这将使性能提高约60%。
编辑:为了完整起见,让我们处理正则表达式
Regex令人惊叹,脚本中的乐趣很多,但是对于这样的事情却效率很低。如果可以避免使用正则表达式,请这样做。我倾向于仅将其用作“快速且肮脏的”-在编码时间而不是执行时间方面要快。阅读正则表达式引擎的工作原理。确实很有趣,但是您会明白为什么它是不得已的选择。
/* this pattern will look for the XML tag.
** then, it will match [^>]+
** [...] will match a single character that matches SOMETHING inside the "character class."
** [^...] will match a single character that is NOT something inside the character class.
** [^>]+ will match as many characters as it can that do not match '>'
** putting this expression inside brackets tells the engine we want to capture it to be referenced later.
** '<' at the end just ensures we capture up until that point.
*/
// create the pattern
Pattern pattern = Pattern.compile("<F_TEXT Encoding=\"unicode\">([^>]+)<");
// get a matcher for it
Matcher matcher = pattern.matcher(textToExtract);
// if we find a match
if (matcher.find()) {
// we can use group(1) to refer to our first capture group
// group(0) will always return the full string matched, but we don't want the tags.
String data= matcher.group(1);
}