如何使用正则表达式格式化RDD,然后在Spark中将其存储到mongoDB

时间:2018-10-06 18:01:22

标签: scala apache-spark spark-streaming

原始数据只是原始Web日志,使用Flume汇总并使用Kafka发布。喜欢:

60.175.130.12 - - [21/Apr/2018:20:46:35 +0800] "GET /wp-admin/edit.php HTTP/1.1" 200 13347 "http://.....php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1 Safari/605.1.15"

我想使用Spark Streaming接收一批日志,然后使用正则表达式将其拆分,如下所示:

val regex = """^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+)\s?(\S+)?\s?(\S+)?" (\d{3}|-) (\d+|-)\s?"?([^"]*)"?\s?"?([^"]*)?"?$""".r

拆分为数据库友好形式:

case class log(
            host: String,
            rfc931: String,
            username: String,
            data_time: String,
            req_method: String,
            req_url: String,
            req_protocol: String,
            statuscode: String,
            bytes: Int,
            referrer: String,
            user_agent: String)

然后只需将批处理追加到mongoDB中即可。

但是我在分割批次时遇到了问题:

  val lines = stream.flatMap{ batch =>
  batch.value().split("\n")
}
val records = lines.map { record =>
  val regex = """^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+)\s?(\S+)?\s?(\S+)?" (\d{3}|-) (\d+|-)\s?"?([^"]*)"?\s?"?([^"]*)?"?$""".r
  val matched = regex.findAllIn(record)
  log(matched.group(1), matched.group(2), matched.group(3), matched.group(4), matched.group(5), matched.group(6), matched.group(7), matched.group(8), matched.group(9).toInt, matched.group(10), matched.group(11))
}
records.foreachRDD{ record =>
  import db.implicits._
  val record_DF = record.toDF()
  record_DF.write.mode("append").mongo()
}

这就是我认为应该实施的方式。 首先将流分成几行,然后使用正则表达式映射每一行,分成日志格式,最后将其写入数据库。

由于“一直没有匹配项”或类似的正则表达式匹配失败问题导致程序失败。

...

只是一个初学者,需要帮助。

1 个答案:

答案 0 :(得分:0)

修改后问题已解决:

 val records = lines.map { record =>
  val PATTERN = """^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+)\s?(\S+)?\s?(\S+)?" (\d{3}|-) (\d+|-)\s?"?([^"]*)"?\s?"?([^"]*)?"?$""".r
  val options = PATTERN.findFirstMatchIn(record)
  val matched = options.get
  log(matched.group(1), matched.group(2), matched.group(3), matched.group(4), matched.group(5), matched.group(6), matched.group(7), matched.group(8), matched.group(9).toInt, matched.group(10), matched.group(11))
}
records.print()