Scala正则表达式匹配具有特殊字符的行

时间:2015-11-12 10:32:24

标签: regex scala special-characters

我有一个从文件中读取行的代码段,我想过滤掉某些行。基本上,我想过滤掉没有三个制表符分隔列的所有内容,其中第一列是数字,另外两列可以包含除制表符和换行符之外的所有字符(Dos& Unix)。

我已经在http://www.regexr.com/检查了我的正则表达式,并且它有效。

scala> val mystr = """123456\thttp://some.url/path/to/resource\t\x03U\x1D\x1F\x04D0B0@\xA0>\xA0<\x86:http://some.url/path/to/resource\x06\x08+\x06\x01\x05\x05\x07\x01\x01\x04C0A0?\n"""
scala> val myreg = "^[0-9]+(\t[^\t\r\n]+){2}(\n|\r\n)$"

scala> mystr.matches(myreg)
res2: Boolean = false

我发现问题与特殊字符有关。例如一个简单的例子:

scala> val tabstr = """123456\t123456"""
scala> val tabreg = "^[0-9]+\t[0-9]+$"
scala> tabstr.matches(tabreg)
res3: Boolean = false

scala> val tabstr = "123456\t123456"
scala> val tabreg = "^[0-9]+\t[0-9]+$"
scala> tabstr.matches(tabreg)
res4: Boolean = true

似乎我不能在我的行中使用原始字符串(请参阅第一个代码块中的mystr)。但如果我不使用原始字符串scala抱怨

error: invalid escape character

那么如何处理这个混乱的输入并仍然使用我的正则表达式过滤掉一些行?

1 个答案:

答案 0 :(得分:4)

You are using raw string literals. Inside raw string literals, \ is not used to escape sequences like tab \t or newline \n, the \n in a raw string literal is just 2 characters following each other.

In a regex, to match a literal \, you need to use 2 backslashes in a raw-string literal based regex, and 4 backslashes in a regular string.

So, to match all your inputs, you need to use the following regexps:

val mystr = """23456\thttp://some.url/path/to/resource\t\x03U\x1D\x1F\x04D0B0@\xA0>\xA0<\x86:http://some.url/path/to/resource\x06\x08+\x06\x01\x05\x05\x07\x01\x01\x04C0A0?\n"""
val myreg = """[0-9]+(?:\\t(?:(?!\\[trn]).)*){2}(?:\\r)?(?:\\n)"""
println(mystr.matches(myreg)) // => true
val tabstr = """123456\t123456"""
println(tabstr.matches("""[0-9]+\\t[0-9]+""")) // => true
val tabstr2 = "123456\t123456"
println(tabstr2.matches("""^[0-9]+(?:\\t|\t)[0-9]+$""")) // => true

Non-capturing groups are not of importance here, since you just need to check if a string matches (that means, you do not even need a ^ and $ since the whole input string must match) and you can still use capturing groups. If you later need to extract any matches/capturing groups, non-capturing groups will help you get a "cleaner" output structure, that is it.

The last two regexps are easy enough, (?:\\t|\t) matches either a \+t or a tab. \t just matches a tab.

The first one has a tempered greedy token (this is a simplified regex, a better one can be used with unrolling the loop method: [0-9]+(?:\\t[^\\]*(?:\\(?![trn])[^\\]*)*){2}(?:\\r)?(?:\\n)).

  • [0-9]+ - 1 or more digits
  • (?:\\t(?:(?!\\[trn]).)*){2} - tempered greedy token, 2 occurrences of a literal string \t followed by any characters but a newline other than 2-symbol combinations \t or \r or \n.
  • (?:\\r)? - 1 or 0 occurrences of \r
  • (?:\\n) - one occurrence of a literal combination of \ and n.