Question

我有一个流媒体工作，

1. Load the data from hdfs file and register it as a temp table.
2. Then join the temp table with tables present in the hive database.
3. Then send the record to Kafka.

最初需要12秒完成第一个周期，然后在10小时后增加到50秒。我不明白这个问题。另外我注意到，在10小时后，每个节点的shuffle写入也在增加，它是200GB +

示例代码是，

val rowRDD = hContext.read.format("com.databricks.spark.csv").option("header", "false").
          option("delimiter", delimiter).load(path).map(col => dosomething)
//Add filter to rdd to convert the data in a time range.
val filteredRDD = rowRDD.filter { col =>{dosomething}}
//Create the data from changed data RDD and schema to new DataFrame
val tblDF = hContext.createDataFrame(filteredRDD, tblSchema).where("crud_status IN ('U','D','I')")
//Register all the records into the temporary tables.
tblDF.registerTempTable("name_changed")
val userDF = hContext.sql("SELECT id,name,account FROM name_changed JOIN account ON(name_changed.id=account.id) JOIN question 
        on (account.question=question.question)")
userDF.foreachPartition { records =>
        val producer = getKafkaProducer(kafkaBootstrap)
        records.foreach { rowData =>
            producer.send(new ProducerRecord[String, Array[Byte]](topicName, rowData) )
          }
        }
        producer.close()
}

Spark Streaming工作时间逐渐增加

0 个答案: