当我用kafka使用sparkstreamimg和sparksql时,在我的代码中有很多sql查询,我在spark驱动程序中查看gc恢复情况,发现旧的gen增加得非常快且完全GC非常频繁,直到内存泄漏,然后驱动程序关闭;我分析堆内存,有大量的org.apache.spark.sql.columnar.ColumnBuilder对象,占用了90%的空间,看看源是HeapByteBuffer占用,我不知道为什么这些对象都是没有发布,一直在等待GC回收。 在我的集群中,我分配了驱动程序2g内存;如果我想运行更长的流程序,只能增加驱动程序内存,但我认为这不是一种合理的方式,内存最终会被填满无法释放,只能重启司机程序;有人可以告诉我为什么吗?我的程序有问题还是什么? 我的代码在这里:
VLOOKUP()
jstat info:
object LogAnalyzerStreamingSQL {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Log Analyzer Streaming in Scala")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val ssc = new StreamingContext(sc, 30)
val topicSet = Set("applogs")
val kafkaParams = Map[String, String](
"metadata.broker.list" -> "192.168.100.1:9092,192.168.100.2:9092,192.168.100.3:9092",
"group.id" -> "app_group",
"serializer.class" -> "kafka.serializer.StringEncoder")
val kafkaStream= KafkaUtils.createDirectStream(ssc,kafkaParams,topics)
kafkaStream.foreachRDD(rdd => {
if (!rdd.isEmpty()) {
val jsonRdd=rdd.map(x=>x._2)
val df = sqlContext.read.json(jsonRdd)
df.registerTempTable("applogs")
sqlContext.cacheTable("applogs")
// Calculate statistics based on the content size.
val contentSizeStats = sqlContext
.sql("SELECT SUM(contentSize), COUNT(*), MIN(contentSize), MAX(contentSize) FROM applogs")
.show()
// Compute Response Code to Count.
val responseCodeToCount = sqlContext
.sql("SELECT responseCode, COUNT(*) FROM applogs GROUP BY responseCode")
.map(row => (row.getInt(0), row.getLong(1)))
.show()
// Any IPAddress that has accessed the server more than 10 times.
val ipAddresses =sqlContext
.sql("SELECT ipAddress, COUNT(*) AS total FROM applogs GROUP BY ipAddress HAVING total > 10")
.map(row => row.getString(0))
.take(100)
val topEndpoints = sqlContext
.sql("SELECT endpoint, COUNT(*) AS total FROM applogs GROUP BY endpoint ORDER BY total DESC LIMIT 10")
.map(row => (row.getString(0), row.getLong(1)))
.show()
//....a lot of sql like that
sqlContext.uncacheTable("applogs")
}
})
ssc.start()
ssc.awaitTermination()
}
}
分析堆内存img: [在此输入图像说明] [2] [在此处输入图像说明] [3]
$ ./jstat -gcutil 48004 5000
S0 S1 E O P YGC YGCT FGC FGCT GCT
0.00 0.00 98.87 100.00 27.03 18337 137.051 1948 1724.113 1861.164
0.00 0.00 99.51 100.00 27.03 18337 137.051 1950 1726.043 1863.094
0.00 0.00 46.40 100.00 27.02 18337 137.051 1951 1729.225 1866.276
0.00 0.00 100.00 100.00 27.02 18337 137.051 1953 1730.202 1867.253
0.00 0.00 100.00 100.00 27.02 18337 137.051 1956 1735.058 1872.110
0.00 0.00 100.00 100.00 27.02 18337 137.051 1959 1739.521 1876.572
0.00 0.00 100.00 100.00 27.02 18337 137.051 1962 1743.700 1880.751
0.00 0.00 100.00 100.00 27.02 18337 137.051 1964 1745.484 1882.535
0.00 0.00 63.82 100.00 27.02 18338 137.434 1965 1748.976 1886.409
0.00 0.00 65.91 100.00 27.02 18338 137.434 1967 1751.125 1888.558
0.00 0.00 100.00 100.00 27.02 18338 137.434 1970 1753.688 1891.122
0.00 0.00 99.82 100.00 27.02 18338 137.434 1975 1759.191 1896.625
0.00 0.00 13.42 100.00 27.02 18338 137.434 1977 1762.786 1900.220