如何使用RegexParser正确解析文本文件?

时间:2016-08-18 15:30:35

标签: regex scala parsing

我想解析以下测试数据:它适用于3个案例,所以我认为我的正则表达式存在问题。如果一行以#开头并且注释也以#开头,那么它就会停止工作。有人可以解释原因吗?到目前为止,这是我的解决方案......

val testDate =
  """
    |127.0.0.1 ads234.com
    |#127.0.0.1 auto.search.msn.com  # Microsoft uses this server to redirect
    |#127.0.0.1 sitefinder.verisign.com # Verisign has joined the game
    |#127.0.0.1 sitefinder-idn.verisign.com # of trying to hijack mistyped
    |#127.0.0.1 s0.2mdn.net     # This may interfere with some streaming
    |#127.0.0.1 ad.doubleclick.net   # This may interfere with www.sears.com
    |127.0.0.1 media.fastclick.net  # Likewise, this may interfere with some
    |127.0.0.1 cdn.fastclick.net
  """.stripMargin

我想保留#和评论,如果有的话。

object Example extends RegexParsers {
  def comment: Parser[String] = """#.*""".r
  def url: Parser[String] = """[A-Za-z0-9-\.\_\-]{1,65}(?<!-)\.+[A-Za-z]{2,7}""".r
  def localhost: Parser[String] = """\b(\d{1,3}\.){3}\d{1,3}\b""".r
  def pound: Parser[String] = "#".r
  def port: Parser[String] = """:\d{3}""".r

  def urlPort = url | url <~ port

  def pos1 = localhost ~ urlPort ^^ {
    case _ ~ dns => LineParsed("", dns, "")
  }
  def pos2 = pound ~ localhost ~ urlPort ^^ {
    case p ~ _ ~ dns => LineParsed(p, dns, "")
  }
  def pos3 = localhost ~ urlPort ~ comment ^^ {
    case _ ~ dns ~ com => LineParsed("", dns, com)
  }
  def pos4 =enter code here pound ~ localhost ~ urlPort ~ comment ^^ {
    case p ~ _ ~ dns ~ com => LineParsed(p, dns, com)
  }

  def linePos = pos1 | pos2 | pos3 | pos4

  def fullLine = repsep(linePos, """\W*""".r)
}

得到以下例外:

#127.0.0.1 auto.search.msn.com  # Microsoft uses this server to redirect

                                  ^
    java.lang.RuntimeException: No result when parsing failed

1 个答案:

答案 0 :(得分:1)

您的代码中存在一些错误。首先,默认情况下,换行符被视为空格,但您需要&#34;请参阅&#34;他们正确地打破了条目。所以你需要重新定义空格:

object Example extends RegexParsers {
   override protected val whiteSpace: Regex = "[ \t]+".r  

然后将fullLine解析器写为:

   //allow several empty lines at the beginning and between entries
   def fullLine = rep("\n") ~> repsep(linePos, rep1("\n")) 

(另一种选择是预先拆分线并单独解析它们)

下一个错误是您将解析器与|组合在一起的方式。要解析A,可选地后跟B,请不要写A | A ~ B。在阅读B后,它永远不会尝试阅读A,因为左侧已经成功。改为写A ~ B.?

  def urlPort = url <~ port.?  // But anyway, you'll neve have a port in a host file !

同样,4个案例pos1 | pos2 | pos3 | pos4可以大大简化:

  def linePos = pound.? ~ localhost ~ urlPort ~ comment.? ^^ {
     case p ~ _ ~ dns ~ com  ⇒ LineParsed(p.getOrElse(""), dns,com.getOrElse(""))
  }

您可以在此处看到?组合器如何为Optionp提供com。我使用getOrElse来适应LineParsed的结构并保留代码的原始行为,但更多的scala-ish方法是将其保留为LineParsed中的一个选项类。

以下是解析您的示例的最终工作代码:

object Example extends RegexParsers {
  override protected val whiteSpace: Regex = "[ \t]+".r
  def comment: Parser[String] = """#.*""".r
  def url: Parser[String] = """[A-Za-z0-9-\.\_\-]{1,65}(?<!-)\.+[A-Za-z]{2,7}""".r
  def localhost: Parser[String] = """\b(\d{1,3}\.){3}\d{1,3}\b""".r
  def pound: Parser[String] = "#".r
  def port: Parser[String] = """:\d{3}""".r
  def urlPort = url <~ port.?

  def linePos = pound.? ~ localhost ~ urlPort ~ comment.? ^^ {
    case p ~ _ ~ dns ~ com  ⇒ LineParsed(p.getOrElse(""), dns, com.getOrElse(""))
  }

  def fullLine = rep("\n") ~> repsep(linePos, rep1("\n"))
}