如何使用Spark RegExp文件?

时间:2014-08-20 08:51:58

标签: regex scala mapping apache-spark rdd


我有UDP_file.txt包含:

2014-03-02 07:59:37;source-address=123.235.78.125 source-port=1780
2014-03-02 07:59:37;source-address=123.235.132.181 source-port=56399
2014-03-02 07:59:37;source-address=123.234.141.253 source-port=49170
2014-03-02 07:59:37;source-address=123.234.104.225 source-port=39123
2014-03-02 07:59:37;source-address=123.234.104.225 fake-port=0000

我需要做的是:

  • 加载文件,
  • RegExp it,
  • 行匹配模式保存在文件' good_records.txt',
  • 与不匹配模式保存的文件' bad_records.txt'

val file_in = sc.textFile("UPD_file.txt")
val FullName = """(^.{19}).+source-address=([^"]+) source-port=([^"]+)""".r

当我在一行上测试模式时,它可以工作:

scala> val FullName(ip,sa,sp) = "2014-03-02 07:59:37;source-address=10.114.104.225 source-port=3912
ip: String = 2014-03-02 07:59:37
sa: String = 10.114.104.225
sp: String = 39123

scala> "2014-03-02 07:59:37;source-address=10.115.78.125 source-port=1780" match { case FullName(ip,sa,sp) }
(2014-03-02 07:59:37,10.115.78.125,1780)

但我不知道如何在加载文件的每一行上使用它。

file_in.AndWhatNow?
你能帮忙吗?如有任何建议,我将不胜感激 的Pawel

2 个答案:

答案 0 :(得分:4)

您可以将输入拆分为单独的行并映射到其上

val FullName = """(.+);source-address=(.+) (?:fake|source)-port=(.+)""".r

val names = file_in map { line =>
    val FullName(ip, sa, sp) = line
    (ip, sa, sp)
}

<强>更新

按端口类型拆分结果会将其捕获到组中,然后应用partition方法

val FullName = """(.+);source-address=(.+) (fake|source)-port=(.+)""".r

val (goodOnes, fakes) = file_in map { line =>
  val FullName(ip, sa, pt, sp) = line
  (ip, sa, pt, sp)
} partition { _._3 == "source" }

答案 1 :(得分:0)

使用previouse解决方案,当行与模式不匹配时,我们会收到错误。
如果我们想要为匹配模式的行返回不同的值,而对于那些不匹配或不匹配的事件,则使用此代码:

val names = file_in map { line => line match { 
  case FullName(ip,sa,sp) => (ip,sa,sp) 
  case Second_FullName(val1, val2) => (val1, val2) 
  case _ =>  Nil
}
}