我是流处理领域的新手,并且正在尝试使用Spark结构化流进行以下操作-
现在,我已经在Apache Flink上进行了尝试,并获得了成功,但是我不知道如何在Spark中执行此操作。做到这一点的唯一方法似乎是通过 flatMapGroupsWithState 操作。但是,这需要完成一个小组,我没有理由这样做。
我的工作看起来像这样-
public class KafkaJob {
public static void main(String[] args) throws StreamingQueryException {
SparkSession spark = SparkSession.builder().appName("KafkaJob1").getOrCreate();
Dataset<Row> inputStream = spark.readStream().format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "testtopic").load()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)");
Dataset<String> eventStream = inputStream.flatMap(new EventSplitter(), Encoders.STRING());
DataStreamWriter<String> streamWriter = eventStream.writeStream().foreach(new HttpSink());
StreamingQuery query = streamWriter.start();
query.awaitTermination();
}
}
我的平面图功能是这个-
public class EventSplitter implements FlatMapFunction<Row, String> {
private static final long serialVersionUID = -7835625460262333452L;
public Iterator<String> call(Row rawEventRow) throws Exception {
List<String> eventList = Collections.emptyList();
String rawEvent = rawEventRow.getString(1);
if (rawEvent != null && !rawEvent.isEmpty()) {
System.out.println("EventSplitter received event with length " + rawEvent.length());
String[] eventArr = rawEvent.split("SOME DELIMITER");
if (eventArr != null && eventArr.length > 0) {
eventList = Arrays.asList(eventArr);
}
}
System.out.println("EventSplitter generated " + eventList.size() + " events");
return eventList.iterator();
}
}