scala正则表达式stackoverflow

时间:2013-09-14 18:35:39

标签: java regex scala

在scala中键入此内容(与regexp匹配的模式以查找id字段的值

val str = """<path sodipodi:nodetypes="csszsscsscssssscssssscc" inkscape:connector-curvature="0" id="basarbre" d="M 111.11111,111.11111 C 101.11111,111.1001 111.11111,111.11111 111.1011,101.01111 111.11111,111.1111 111.11111,110.11111 111.10111,111.11101 110.01111,111.11111 110.11111,111.11101 111.11111,111.01111 110.11111,111.1111 101.11111,111.10111 111.11111,111.11111 111.11111,101.11111 111.11111,111.11111 111.11111,111.11111 111.11111,111.11101 111.11111,101.11111 111.11111,101.11111 111.11111,101.11111 111.111,111.11101 101.01111,110.11111 111.11111,111.11111 101.1111,111.11111 101.11101,110.11111 111.10111,110.11101 101.11111,111.11111 101.11111,111.11111 101.11111,111.11111 111.11111,110.1111 111.10111,111.11111 111.11011,111.11111 111.11101,111.11111 111.01111,111.11111 110.11111,111.11111 111.11111,111.11111 110.01111,111.11111 111.11111,111.11111 111.11111,111.11111 111.01111,101.11111 111.11111,111.11101 110.11011,110.11111 101.11111,111.01111 11.111111,111.11111 11.111111,111.11111 11.111111,111.11111 11.111111,111.11111 11.111111,111.1111 10.111111,111.11111 11.111111,101.11111 11.010111,100.11111 11.111111,110.11111 11.111111,110.11111 11.111111,111.11111 11.111111,111.11111 11.010111,111.1111 11.101111,111.01111 11.11011,101.11111 -11.111111,110.11111 11.011111,111.11111 11.111111,111.10101 11.11111,111.11111 111.11101,111.01011 111.11101,111.01011 z" style="fill:#511b00;fill-opacity:1;stroke:none" xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape" xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/2000/svg" xmlns:svg="http://www.w3.org/2000/svg" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:cc="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:osb="http://www.openswatchbook.org/uri/2009/osb"/>"""

val Idpattern = """.*id="([^"]*)"(?:[\n\r\t]|.)*""".r

str match {
  case Idpattern(id) => id
  case _ => "no id"
}

产生以下异常跟踪:

at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4466)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3694)
at java.util.regex.Pattern$Branch.match(Pattern.java:4502)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4466)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3694)
...

我该如何克服这个问题? 我可以尝试用库解析xml,但我不需要那些混淆的东西。我认为regexp可以快速可靠。

3 个答案:

答案 0 :(得分:5)

实际上scala提供了原生的xml操作。因此,如果您在"""的开头和结尾删除str,它将成为您可以轻松操作的NodeSeq,例如:

val str = <path sodipodi:nodetypes="csszsscsscssssscssssscc" inkscape:connector-curvature="0" id="basarbre" d="M 111.11111,111.11111 C 101.11111,111.1001 111.11111,111.11111 111.1011,101.01111 111.11111,111.1111 111.11111,110.11111 111.10111,111.11101 110.01111,111.11111 110.11111,111.11101 111.11111,111.01111 110.11111,111.1111 101.11111,111.10111 111.11111,111.11111 111.11111,101.11111 111.11111,111.11111 111.11111,111.11111 111.11111,111.11101 111.11111,101.11111 111.11111,101.11111 111.11111,101.11111 111.111,111.11101 101.01111,110.11111 111.11111,111.11111 101.1111,111.11111 101.11101,110.11111 111.10111,110.11101 101.11111,111.11111 101.11111,111.11111 101.11111,111.11111 111.11111,110.1111 111.10111,111.11111 111.11011,111.11111 111.11101,111.11111 111.01111,111.11111 110.11111,111.11111 111.11111,111.11111 110.01111,111.11111 111.11111,111.11111 111.11111,111.11111 111.01111,101.11111 111.11111,111.11101 110.11011,110.11111 101.11111,111.01111 11.111111,111.11111 11.111111,111.11111 11.111111,111.11111 11.111111,111.11111 11.111111,111.1111 10.111111,111.11111 11.111111,101.11111 11.010111,100.11111 11.111111,110.11111 11.111111,110.11111 11.111111,111.11111 11.111111,111.11111 11.010111,111.1111 11.101111,111.01111 11.11011,101.11111 -11.111111,110.11111 11.011111,111.11111 11.111111,111.10101 11.11111,111.11111 111.11101,111.01011 111.11101,111.01011 z" style="fill:#511b00;fill-opacity:1;stroke:none" xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape" xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/2000/svg" xmlns:svg="http://www.w3.org/2000/svg" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:cc="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:osb="http://www.openswatchbook.org/uri/2009/osb"/>

val idAttribute = str \\ "@id"     

val  id = if (idAttribute.isEmpty) "no id" else idAttribute.text

您可以阅读更多here

答案 1 :(得分:2)

对于像这样的任务,最好编写一个只匹配部分字符串的正则表达式:

scala> val Idpattern = """id="([^"]*)"""".r
scala> Idpattern.findFirstMatchIn(str).map(_.group(1))
res10: Option[String] = Some(basarbre)

这样,正则表达式引擎可以通过在字符串中扫描'i'来开始。使用原始正则表达式,贪婪的.*将匹配整个字符串,然后正则表达式引擎将从末尾开始回溯。至于为什么你的正则表达式会破坏堆栈,我认为这可能是Java在正则表达式结束时处理交替的问题,但我不太确定。较短的正则表达式提供较少的递归机会。

答案 2 :(得分:2)

这是对正则表达式的修正,您正在尝试使用行结尾。 (?s)打开DOTALL,因此点匹配。

scala> val Idpattern = """.*id="([^"]*)"(?s).*""".r
Idpattern: scala.util.matching.Regex = .*id="([^"]*)"(?s).*

scala> str match { case Idpattern(id) => id }
res6: String = basarbre

这是在Scala中找到模式的更好方法:

scala> val Idpattern = """ id="([^"]*)" """.r.unanchored
Idpattern: scala.util.matching.UnanchoredRegex =  id="([^"]*)" 

scala> str match { case Idpattern(id) => id }
res7: String = basarbre