我正在尝试使用此代码使用代理版本0.10测试kafka流。这只是打印主题内容的简单代码。没什么大不了的!但是,由于某种原因,内存不足(VM中有10GB的RAM)!代码:
# coding: utf-8
"""
kafka-test-003.py: test with broker 0.10(new Spark Stream API)
How to run this script?
spark-submit --jars jars/spark-sql-kafka-0-10_2.11-2.3.0.jar,jars/kafka-clients-0.11.0.0.jar kafka-test-003.py
"""
import pyspark
from pyspark import SparkContext
from pyspark.sql.session import SparkSession,Row
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
# starting spark session
spark = SparkSession.builder.appName("Kakfa-test").getOrCreate()
spark.sparkContext.setLogLevel('WARN')
# getting streaming context
sc = spark.sparkContext
ssc = StreamingContext(sc, 2) # batching duration: each 2 seconds
broker = "kafka.some.address:9092"
topic = "my.topic"
### Streaming
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", broker) \
.option("startingOffsets", "earliest") \
.option("subscribe", topic) \
.load() \
.select(col('key').cast("string"),col('value').cast("string"))
query = df \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream \
.outputMode("append") \
.format("console") \
.start()
### End Streaming
query.awaitTermination()
正在运行Spark提交:
spark-submit --master local[*] --driver-memory 5G --executor-memory 5G --jars jars/kafka-clients-0.11.0.0.jar,jars/spark-sql-kafka-0-10_2.11-2.3.0.jar kafka-test-003.py
不幸的是,结果是:
java.lang.OutOfMemoryError:Java堆空间
我假设Kafka每次都应该带来少量数据以避免这种问题,对吗?那么,我在做什么错了?
答案 0 :(得分:0)
火花内存管理是一个复杂的过程。最佳解决方案不仅取决于您的数据和操作类型以及系统行为 您可以重试以下spark命令吗?
spark-submit --master local [*] --driver-memory 4G --executor-memory 2G --executor-cores 5 --num-executors 8 --jars jars / kafka-clients-0.11.0.0.jar,jars / spark-sql-kafka-0-10_2.11-2.3.0.jar kafka-test-003.py
是否可以通过调整性能来按以下链接调整上述内存参数? Using spark-submit, what is the behavior of the --total-executor-cores option?