我正在使用kafka开发Spark结构化流应用程序。除了一件事,它工作正常。问题是火花不断将所有分区的偏移量重置为X。这消耗了大量的网络IO和CPU。如果我添加更多的kafka消费者,则CPU消耗会很明显。空闲时将近25%的CPU使用率。 这是正常行为吗?还是我缺少一些配置?
我创建了最小的Spark Kafka消费者应用程序进行演示。
这是全新启动并处于空闲状态的日志。
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:29 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:30 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:30 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:30 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:30 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:30 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:30 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:30 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:30 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:30 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:30 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
19/12/11 16:24:30 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-b2457938-d427-47e2-b90a-7c6f0d85904b--1563005380-driver-0] Resetting offset for partition testJson-0 to offset 2.
项目
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.OutputMode;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import static org.apache.spark.sql.functions.*;
public class Main {
public Main() throws StreamingQueryException {
StructType schema = DataTypes.createStructType(new StructField[]{
DataTypes.createStructField("id", DataTypes.LongType, false),
DataTypes.createStructField("name", DataTypes.StringType, false),
DataTypes.createStructField("value", DataTypes.LongType, false),
});
SparkConf conf = new SparkConf(true)
.setMaster("local[1]")
.set("spark.default.parallelism", "1")
.setAppName("spark-kafka-demo1");
JavaSparkContext context = new JavaSparkContext(conf);
SparkSession session = SparkSession
.builder()
.config(conf)
.sparkContext(context.sc())
.appName("spark-kafka-demo1")
.getOrCreate();
Dataset<Row> dataset = session
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.0.201:9092")
.option("subscribe", "testJson")
.load()
.selectExpr("CAST(value AS STRING) as message")
.select(from_json(col("message"), schema).as("t"));
StreamingQuery query = dataset
.groupBy("t.name")
.agg(sum("t.value"))
.writeStream()
.format("console")
.option("truncate", false)
.outputMode(OutputMode.Update())
.start();
query.awaitTermination();
}
public static void main(String[] args) throws StreamingQueryException {
new Main();
}
}
Maven依赖项
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.4.4</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>2.4.4</version>
</dependency>