Spark Streaming过滤流数据

时间:2016-08-07 06:11:11

标签: apache-spark spark-streaming spark-cassandra-connector

我正在尝试过滤流数据,并根据我希望将数据保存到不同表的id列的值

我有两张桌子

  1. testTable_odd(id,data1,data2)
  2. testTable_even(id,data1)
  3. 如果id值是奇数,那么我想将记录保存到testTable_odd表,如果值是偶数,那么我想将记录保存到testTable_even。

    这里棘手的部分是我的两个表有不同的列。尝试多种方式,考虑Scala函数的返回类型为[obj1,obj2],但我无法成功,任何指针都会非常感激。

    import org.apache.spark.SparkConf
    import org.apache.spark.SparkContext
    import org.apache.spark.sql.SaveMode
    import org.apache.spark.streaming.Seconds
    import org.apache.spark.streaming.StreamingContext
    import org.apache.spark.streaming.kafka.KafkaUtils
    import com.datastax.spark.connector._
    
    import kafka.serializer.StringDecoder
    import org.apache.spark.rdd.RDD
    import com.datastax.spark.connector.SomeColumns
    import java.util.Formatter.DateTime
    
    object StreamProcessor extends Serializable {
      def main(args: Array[String]): Unit = {
        val sparkConf = new SparkConf().setMaster("local[2]").setAppName("StreamProcessor")
          .set("spark.cassandra.connection.host", "127.0.0.1")
    
        val sc = new SparkContext(sparkConf)
    
        val ssc = new StreamingContext(sc, Seconds(2))
    
        val sqlContext = new SQLContext(sc)
    
        val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
    
        val topics = args.toSet
    
        val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
          ssc, kafkaParams, topics)
    
    
            stream
      .map { 
        case (_, msg) => 
          val result = msgParseMaster(msg)
          (result.id, result.data)
       }.foreachRDD(rdd => if (!rdd.isEmpty)     rdd.saveToCassandra("testKS","testTable",SomeColumns("id","data")))
    
          }
        }
    
        ssc.start()
        ssc.awaitTermination()
    
      }
    
      import org.json4s._
      import org.json4s.native.JsonMethods._
      case class wordCount(id: Long, data1: String, data2: String) extends serializable
      implicit val formats = DefaultFormats
      def msgParseMaster(msg: String): wordCount = {
        val m = parse(msg).extract[wordCount]
        return m
    
      }
    
    }
    

2 个答案:

答案 0 :(得分:1)

我认为您只想使用过滤器功能两次。你可以做点什么

val evenstream = stream.map { case (_, msg) => 
  val result = msgParseMaster(msg)
  (result.id, result.data)
}.filter{ k =>
  k._1 % 2 == 0
}

evenstream.foreachRDD{rdd=>
  //Do something with even stream
}

val oddstream = stream.map { case (_, msg) => 
  val result = msgParseMaster(msg)
  (result.id, result.data)
}.filter{ k =>
  k._1 % 2 == 1
}

oddstream.foreachRDD{rdd=>
  //Do something with odd stream
}

当我在项目here上做了类似的事情时,如果你在第191行附近向下看,我会使用过滤器功能两次。在那里,我根据它们在0和1之间的值对元组进行分类和保存,所以随便检查一下。

答案 1 :(得分:1)

我已执行以下步骤。  1)从原始JSON字符串和案例类中提取细节 2)创建了超级JSON(其中包含两个过滤条件都需要的详细信息) 3)将JSON转换为DataFrame 4)对该JSON执行了select和where子句 5)保存到Cassandra