我有2个文件,如下所示。
关键字文件
spark
scala
hive
内容文件
this is spark.
this can be scala and spark.
this is hive.
我的目的是在内容文件的每一行中查找关键字。在搜索时,我应该只能获取关键字的最后一次出现(即即使内容包含2个关键字,我也应该仅获取最后一次出现)并创建一个csv文件以将数据加载到配置单元表中。 / p>
预期产量
"this is spark.","spark"
"this can be scala and spark.","spark"
"this is hive.","hive"
我的内容文件具有数百万行。获得输出的最佳和优化方法是什么
答案 0 :(得分:0)
问题很抽象,假设将内容加载到RDD中,将关键字加载到列表中,下面的代码有效。
scala> val contents = sc.parallelize(Seq("this is spark.","this can be scala and spark.","this is hive."))
contents: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[137] at parallelize at <console>:24
scala> val keywordsRdd = sc.parallelize(Seq("spark", "scala", "hive"))
keywordsRdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[138] at parallelize at <console>:24
scala> val keywords:List[String] = keywordsRdd.collect.toList
keywords: List[String] = List(spark, scala, hive)
scala> val mappedData = contents.flatMap(x=>x.split(",")).map(x => (x,x.split("\\s+").last.replaceAll("[.]",""))).filter(x=> keywords.contains(x._2)).collect.foreach(println)
(this is spark.,spark)
(this can be scala and spark.,spark)
(this is hive.,hive)