我正在使用scala查找一行中的特定单词并提取该行,因为我正在使用的样本具有数据
MSH|^~\&|RQ|BIN|SMS|BIN|2019||ORU^R01|120330003918|J|2.2
PID|1|xxxxx|xxxx||TEST|Rooney|19761202|M|MR^^M^MR^MD^11|7|0371 HOES LANE^0371
以下是我的代码
object WordCount {
def main(args: Array[String])
{
val textfile = sc.textFile("/user/cloudera/xxx/xxx")
val word = textfile.filter(x => x.length > 0).map(_.split('|'))
val keys = word.map(tuple => (tuple(0),tuple(5),(tuple(6)) ))
val data =keys.map(x => x._1 + "," + x._2+ "," + x._3)
val srch = data.filter(_.contains("PID")).map(tuple => (tuple(0),tuple(1),(tuple(2)) ))
val show = srch.map(x => x._1 + "," + x._2+ "," + x._3)
data.saveAsTextFile("/user/cloudera/xxxx/Sparktest")
}
}
我得到的结果:
MSH,BIN,20121009151949
PID,TEST^PATIENT,Rooney
预期结果
PID,TEST^PATIENT,Rooney
我错过了什么。请帮忙
答案 0 :(得分:1)
不应该:
show.saveAsTextFile(" /用户/ Cloudera的/ XXXX / Sparktest&#34)
答案 1 :(得分:0)
我没有尝试过这段代码,但它应该看起来像:
val textfile = sc.textFile("/user/cloudera/xxx/xxx")
val word = textfile.filter(x => x.length > 0).map(_.split('|'))
val keys = word.map(tuple => (tuple(0),tuple(5),(tuple(6)) ))
val data = keys.map(x => x._1 + "," + x._2+ "," + x._3)
val srch = data.filter(_.contains("PID"))
srch.saveAsTextFile("/user/cloudera/xxxx/Sparktest")
它可以转化为类似的东西:
val textfile = sc.textFile("/user/cloudera/xxx/xxx")
val result = textfile
.filter(x => x.length > 0)
.map(_.split('|'))
.map(array => (array(0),array(5),array(6))
.map(x => x._1 + "," + x._2+ "," + x._3)
.filter(_.contains("PID"))
result.saveAsTextFile("/user/cloudera/xxxx/Sparktest")
这样你就不需要为每个中间步骤赋予名称(除非你真的需要特别针对某些特定的中间步骤),并且会阻止你所拥有的错误,这就是保存val {{1而不是data
。
答案 2 :(得分:0)
您正在保存错误的数据。你应该保存节目而不是数据。我想你可以进一步重构代码。
case class Header(h1:String,h2:String,h3:String){
override def toString = h1+","+h2+","+h3
}
代码看起来像
object WordCount {
def main(args: Array[String])
{
val textfile = sc.textFile("/user/cloudera/xxx/xxx")
val word = textfile.filter(x => x.length > 0).map(_.split('|'))
val keys = word.map(tuple => Header(tuple(0),tuple(5),tuple(6)))
val search = keys.filter(_.h1.equals("PID")).map(_.toString)
search.saveAsTextFile("/user/cloudera/xxxx/Sparktest")
}
}
因此,假设将来可以重复使用案例类