根据位置

时间:2018-12-28 07:07:50

标签: scala apache-spark lookup

我有2个文件,如下所示。

关键字文件

spark
scala
hive

内容文件

this is spark.
this can be scala and spark.
this is hive.

我的目的是在内容文件的每一行中查找关键字。在搜索时,我应该只能获取关键字的最后一次出现(即即使内容包含2个关键字,我也应该仅获取最后一次出现)并创建一个csv文件以将数据加载到配置单元表中。 / p>

预期产量

"this is spark.","spark"
"this can be scala and spark.","spark"
"this is hive.","hive"

我的内容文件具有数百万行。获得输出的最佳和优化方法是什么

1 个答案:

答案 0 :(得分:0)

问题很抽象,假设将内容加载到RDD中,将关键字加载到列表中,下面的代码有效。

scala> val contents = sc.parallelize(Seq("this is spark.","this can be scala and spark.","this is hive."))
contents: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[137] at parallelize at <console>:24

scala> val keywordsRdd = sc.parallelize(Seq("spark", "scala", "hive"))
keywordsRdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[138] at parallelize at <console>:24

scala> val keywords:List[String] = keywordsRdd.collect.toList
keywords: List[String] = List(spark, scala, hive)

scala> val mappedData = contents.flatMap(x=>x.split(",")).map(x => (x,x.split("\\s+").last.replaceAll("[.]",""))).filter(x=> keywords.contains(x._2)).collect.foreach(println)
(this is spark.,spark)
(this can be scala and spark.,spark)
(this is hive.,hive)