根据spark中的特定值提取行

时间:2016-08-03 13:55:30

标签: scala hadoop apache-spark

我正在使用scala查找一行中的特定单词并提取该行,因为我正在使用的样本具有数据

MSH|^~\&|RQ|BIN|SMS|BIN|2019||ORU^R01|120330003918|J|2.2
PID|1|xxxxx|xxxx||TEST|Rooney|19761202|M|MR^^M^MR^MD^11|7|0371 HOES LANE^0371

以下是我的代码

object WordCount {
    def main(args: Array[String])
    {
        val textfile = sc.textFile("/user/cloudera/xxx/xxx")
        val word = textfile.filter(x => x.length >  0).map(_.split('|'))
        val keys = word.map(tuple => (tuple(0),tuple(5),(tuple(6)) ))
        val data =keys.map(x => x._1 + "," + x._2+ "," + x._3)
        val srch = data.filter(_.contains("PID")).map(tuple => (tuple(0),tuple(1),(tuple(2)) ))
        val show = srch.map(x => x._1 + "," + x._2+ "," + x._3)
        data.saveAsTextFile("/user/cloudera/xxxx/Sparktest")
    }
}

我得到的结果:

MSH,BIN,20121009151949
PID,TEST^PATIENT,Rooney

预期结果

PID,TEST^PATIENT,Rooney

我错过了什么。请帮忙

3 个答案:

答案 0 :(得分:1)

不应该:

  

show.saveAsTextFile(" /用户/ Cloudera的/ XXXX / Sparktest&#34)

答案 1 :(得分:0)

我没有尝试过这段代码,但它应该看起来像:

val textfile = sc.textFile("/user/cloudera/xxx/xxx")
val word = textfile.filter(x => x.length >  0).map(_.split('|'))
val keys = word.map(tuple => (tuple(0),tuple(5),(tuple(6)) ))
val data = keys.map(x => x._1 + "," + x._2+ "," + x._3)
val srch = data.filter(_.contains("PID"))
srch.saveAsTextFile("/user/cloudera/xxxx/Sparktest")

它可以转化为类似的东西:

val textfile = sc.textFile("/user/cloudera/xxx/xxx")
val result = textfile
    .filter(x => x.length >  0)
    .map(_.split('|'))
    .map(array => (array(0),array(5),array(6))
    .map(x => x._1 + "," + x._2+ "," + x._3)
    .filter(_.contains("PID"))
result.saveAsTextFile("/user/cloudera/xxxx/Sparktest")

这样你就不需要为每个中间步骤赋予名称(除非你真的需要特别针对某些特定的中间步骤),并且会阻止你所拥有的错误,这就是保存val {{1而不是data

答案 2 :(得分:0)

您正在保存错误的数据。你应该保存节目而不是数据。我想你可以进一步重构代码。

case class Header(h1:String,h2:String,h3:String){
   override def toString = h1+","+h2+","+h3
 }

代码看起来像

object WordCount {
def main(args: Array[String])
{
    val textfile = sc.textFile("/user/cloudera/xxx/xxx")
    val word = textfile.filter(x => x.length >  0).map(_.split('|'))
    val keys = word.map(tuple => Header(tuple(0),tuple(5),tuple(6)))
    val search = keys.filter(_.h1.equals("PID")).map(_.toString)
    search.saveAsTextFile("/user/cloudera/xxxx/Sparktest")
}
}

因此,假设将来可以重复使用案例类