我正在尝试使用Java RegEx从HTML页面解析数据,但没有太多运气。数据是动态的,通常包括零到多个空格,制表符,新行的实例。而且,根据命中的数量,字符串I的解析结构可能会改变。以下是最干净格式的示例:
<div class="center">Showing 25 of 2,343,098 (search took 1.245 seconds)</div>
但它也可能如下所示:
<div class="center">Showing 2343098 (search took 1.245 seconds)</div>
或
<div class="center">
Showing 125
of 2,343,098
(search took 1.245 seconds)</div>
我试图解析的是2,343,098,但由于页面是HTML,我必须使用&#34;显示&#34;或者&#34;(搜索&#34;在之间进行搜索。空格,标签和新线条让我感到沮丧,我一直试图使用前瞻和后视,但到目前为止还没有运气。这里有一些我试过的模式
String pattern1 = "Showing [0-9]*\\S"; // not useful
String pattern2 = "[[\\d,+\\.?\\d+]*[\\s*\\n]\\(search took"; //fails
String pattern3 = "(/i)(Showing)(.+?)(\\(search took)"; //fails
String pattern4 = "([\\s\\S]*)\\(search took"; //fails
String pattern5 = "(?s)[\\d].*?(?=\\(search took)"; //close...but fails
Pattern pattern = Pattern.compile(pattern5);
Matcher matcher = pattern.matcher(text); // text = the string I'm parsing
while(matcher.find()) {
System.out.println(matcher.group(0));
}
答案 0 :(得分:1)
HTML不是常规语言,无法使用正则表达式准确解析。当标记的格式在未来发生变化时,基于正则表达式的解决方案可能会中断,但基于解析器的解决方案将更加准确。
但是,如果这是一次性工作,您可以使用以下正则表达式:
Showing\s+(?:\d+\s+of\s+)?([\d,.]+)\s+\(search
答案 1 :(得分:0)
示例建议
"Showing\\s+\\d+\\s+(of\\s+[\\d,.]+\\s+)?\\(search"