正则表达式模式与可变数量的匹配捕获组匹配

时间:2013-03-26 15:10:16

标签: regex scala pattern-matching

应使用Scala模式匹配和正则表达式逐行解析文本文件。如果一行以"names:\t"开头,则后续以制表符分隔的名称应以Seq[String](或类似名称)提供。

这是一个非工作代码示例:

val Names = "^names:(?:\t([a-zA-Z0-9_]+))+$".r

"names:\taaa\tbbb\tccc" match {
  case Names(names @ _*) => println(names)
  // […] other cases
  case _ => println("no match")
}

输出:List(ccc)
通缉输出:List(aaa, bbb, ccc)

以下代码可以按需运行...

object NamesObject {
  private val NamesLine = "^names:\t([a-zA-Z0-9_]+(?:\t[a-zA-Z0-9_]+)*)$".r

  def unapplySeq(s: String): Option[Seq[String]] = s match {
    case NamesLine(nameString) => Some(nameString.split("\t"))
    case _ => None
  }
}

"names:\taaa\tbbb\tccc" match {
  case NamesObject(names @ _*) => println(names)
  // […] other cases
  case _ => println("no match")
}

输出(按需):WrappedArray(aaa, bbb, ccc)

我想知道:如果不创建object,这是否可以更简单的方式实现,就像在第一个但不起作用的代码示例中一样?

2 个答案:

答案 0 :(得分:1)

使用你的工作正则表达式。(\w[a-zA-Z0-9_]预定义的字符类)

  val Names = """names:\t(\w+(?:\t\w+)*)""".r
  "names:\taaa\tbbb\tccc" match {
    case Names(names) => println(names.split("\t") toSeq)
    case _ => println("no match")
  }

第一,第二和第二尾部绑定,

  val Names = """names:\t(\w+)?\t?(\w+)?\t?((?:\w+?\t?)*)""".r
  "names:\taaa\tbbb\tccc\tddd" match {
    case Names(first, second, tail) =>
      println(first + ", " + second + ", " + (tail.split("\t") toSeq));
    case _ => println("no match")
  }

答案 1 :(得分:0)

正如Randall Schulz所说,似乎不可能只使用正则表达式。因此,对我的问题的简短回答是 no

我目前的解决方案如下:我使用这个类...

class SeparatedLinePattern(Pattern: Regex, separator: String = "\t") {
  def unapplySeq(s: String): Option[Seq[String]] = s match {
    case Pattern(nameString) => Some(nameString.split(separator).toSeq)
    case _ => None
  }
}

...创建模式:

val Names = new SeparatedLinePattern("""names:\t([A-Za-z]+(?:\t[A-Za-z]+)*)""".r)
val Ints = new SeparatedLinePattern("""ints:\t(\d+(?:\t\d+)*)""".r)
val ValuesWithID = new SeparatedLinePattern("""id-value:\t(\d+\t\w+(?:\t\d+\t\w+)*)""".r)

以下是一个如何以非常灵活的方式使用它们的示例:

val testStrings = List("names:\taaa", "names:\tbbb\tccc", "names:\tddd\teee\tfff\tggg\thhh",
                       "ints:\t123", "ints:\t456\t789", "ints:\t100\t200\t300",
                       "id-value:\t42\tbaz", "id-value:\t2\tfoo\t5\tbar\t23\tbla")

for (s <- testStrings) s match {
  case Names(name) => println(s"The name is '$name'")
  case Names(a, b) => println(s"The two names are '$a' and '$b'")
  case Names(names @ _*) => println("Many names: " + names.mkString(", "))

  case Ints(a) => println(s"Just $a")
  case Ints(a, b) => println(s"$a + $b == ${a.toInt + b.toInt}")
  case Ints(nums @ _*) => println("Sum is " + (nums map (_.toInt)).sum)

  case ValuesWithID(id, value) => println(s"ID of '$value' is $id")
  case ValuesWithID(values @ _*) => println("As map: " + (values.grouped(2) map (x => x(0).toInt -> x(1))).toMap)

  case _ => println("No match")
}

输出:

The name is 'aaa'
The two names are 'bbb' and 'ccc'
Many names: ddd, eee, fff, ggg, hhh
Just 123
456 + 789 == 1245
Sum is 600
ID of 'baz' is 42
As map: Map(2 -> foo, 5 -> bar, 23 -> bla)