Question

当我用kafka使用sparkstreamimg和sparksql时，在我的代码中有很多sql查询，我在spark驱动程序中查看gc恢复情况，发现旧的gen增加得非常快且完全GC非常频繁，直到内存泄漏，然后驱动程序关闭;我分析堆内存，有大量的org.apache.spark.sql.columnar.ColumnBuilder对象，占用了90％的空间，看看源是HeapByteBuffer占用，我不知道为什么这些对象都是没有发布，一直在等待GC回收。在我的集群中，我分配了驱动程序2g内存;如果我想运行更长的流程序，只能增加驱动程序内存，但我认为这不是一种合理的方式，内存最终会被填满无法释放，只能重启司机程序;有人可以告诉我为什么吗？我的程序有问题还是什么？我的代码在这里：

VLOOKUP()

jstat info：

object LogAnalyzerStreamingSQL { 


      def main(args: Array[String]) { 
        val sparkConf = new SparkConf().setAppName("Log Analyzer Streaming in Scala") 
        val sc = new SparkContext(sparkConf) 

        val sqlContext = new SQLContext(sc) 
        import sqlContext.implicits._ 

        val ssc = new StreamingContext(sc, 30) 
        val topicSet = Set("applogs") 
        val kafkaParams = Map[String, String]( 
        "metadata.broker.list" -> "192.168.100.1:9092,192.168.100.2:9092,192.168.100.3:9092", 
        "group.id" -> "app_group", 
        "serializer.class" -> "kafka.serializer.StringEncoder") 
        val kafkaStream= KafkaUtils.createDirectStream(ssc,kafkaParams,topics) 
         kafkaStream.foreachRDD(rdd => { 
            if (!rdd.isEmpty()) { 
              val jsonRdd=rdd.map(x=>x._2) 
              val df = sqlContext.read.json(jsonRdd) 
              df.registerTempTable("applogs") 
              sqlContext.cacheTable("applogs") 

            // Calculate statistics based on the content size. 
            val contentSizeStats = sqlContext 
              .sql("SELECT SUM(contentSize), COUNT(*), MIN(contentSize), MAX(contentSize) FROM applogs") 
             .show() 


            // Compute Response Code to Count. 
            val responseCodeToCount = sqlContext 
              .sql("SELECT responseCode, COUNT(*) FROM applogs GROUP BY responseCode") 
              .map(row => (row.getInt(0), row.getLong(1))) 
              .show() 


            // Any IPAddress that has accessed the server more than 10 times. 
            val ipAddresses =sqlContext 
              .sql("SELECT ipAddress, COUNT(*) AS total FROM applogs GROUP BY ipAddress HAVING total > 10") 
              .map(row => row.getString(0)) 
              .take(100) 


            val topEndpoints = sqlContext 
              .sql("SELECT endpoint, COUNT(*) AS total FROM applogs GROUP BY endpoint ORDER BY total DESC LIMIT 10") 
              .map(row => (row.getString(0), row.getLong(1))) 
              .show() 

              //....a lot of sql like that 

              sqlContext.uncacheTable("applogs") 
          } 
        }) 

        ssc.start() 
        ssc.awaitTermination() 
      } 
    }

分析堆内存img： [在此输入图像说明] [2] [在此处输入图像说明] [3]

     $ ./jstat -gcutil 48004 5000
      S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT
      0.00   0.00  98.87 100.00  27.03  18337  137.051  1948 1724.113 1861.164
      0.00   0.00  99.51 100.00  27.03  18337  137.051  1950 1726.043 1863.094
      0.00   0.00  46.40 100.00  27.02  18337  137.051  1951 1729.225 1866.276
      0.00   0.00 100.00 100.00  27.02  18337  137.051  1953 1730.202 1867.253
      0.00   0.00 100.00 100.00  27.02  18337  137.051  1956 1735.058 1872.110
      0.00   0.00 100.00 100.00  27.02  18337  137.051  1959 1739.521 1876.572
      0.00   0.00 100.00 100.00  27.02  18337  137.051  1962 1743.700 1880.751
      0.00   0.00 100.00 100.00  27.02  18337  137.051  1964 1745.484 1882.535
      0.00   0.00  63.82 100.00  27.02  18338  137.434  1965 1748.976 1886.409
      0.00   0.00  65.91 100.00  27.02  18338  137.434  1967 1751.125 1888.558
      0.00   0.00 100.00 100.00  27.02  18338  137.434  1970 1753.688 1891.122
      0.00   0.00  99.82 100.00  27.02  18338  137.434  1975 1759.191 1896.625
      0.00   0.00  13.42 100.00  27.02  18338  137.434  1977 1762.786 1900.220

在spark 1.4.1中使用sparkstreamimg和sparksql与kafka的内存泄漏

0 个答案: