我有一个流媒体工作,
1. Load the data from hdfs file and register it as a temp table.
2. Then join the temp table with tables present in the hive database.
3. Then send the record to Kafka.
最初需要12秒完成第一个周期,然后在10小时后增加到50秒。我不明白这个问题。另外我注意到,在10小时后,每个节点的shuffle写入也在增加,它是200GB +
示例代码是,
val rowRDD = hContext.read.format("com.databricks.spark.csv").option("header", "false").
option("delimiter", delimiter).load(path).map(col => dosomething)
//Add filter to rdd to convert the data in a time range.
val filteredRDD = rowRDD.filter { col =>{dosomething}}
//Create the data from changed data RDD and schema to new DataFrame
val tblDF = hContext.createDataFrame(filteredRDD, tblSchema).where("crud_status IN ('U','D','I')")
//Register all the records into the temporary tables.
tblDF.registerTempTable("name_changed")
val userDF = hContext.sql("SELECT id,name,account FROM name_changed JOIN account ON(name_changed.id=account.id) JOIN question
on (account.question=question.question)")
userDF.foreachPartition { records =>
val producer = getKafkaProducer(kafkaBootstrap)
records.foreach { rowData =>
producer.send(new ProducerRecord[String, Array[Byte]](topicName, rowData) )
}
}
producer.close()
}