我正在使用Spark来消费Kafka。 Kafka生产者以json格式发送数据。 但是json数据具有空值。要将空值设置为默认值,我使用spark sql。
我的意思是我通过spark从RDD格式获取json数据,然后将rdd收集到数据集并在spark sql中处理数据集,并将sql结果设置为新数据集。要将最终数据集设置为java列表,在将数据集反转为新RDD之前,哪个数据类型为customclass,之后最后一个进程正在收集RDD以设置数据到列表。在收集过程中,我收到以下错误;
org.apache.kafka.clients.consumer.ConsumerRecord无法转换为 java.lang.String中
这是我的代码
public void start(String[] args) throws Exception {
final String brokers = args[0];
final String topic = args[1];
SentimentAnalyse.serviceURL = args[2].trim();
final String hbaseTableName = args[3].trim();
final String zk = args[4].trim();
final String consumerId = args[5].trim();
Long batchInterval = Long.getLong(args[6].trim());
Long windowInterval = Long.getLong(args[7].trim());
JavaStreamingContext ssc = Streaming.createContext4SM(batchInterval, windowInterval, brokers, topic, hbaseTableName, zk, consumerId);
ssc.start();
ssc.awaitTermination();
}
public class Streaming implements Serializable {
public static JavaStreamingContext createContext4SM(Long batchInterval, Long windowInterval, String _brokers,
String _topic, final String hbaseTableName, final String _zk, final String _consumerID) {
SparkConf sparkConf = new SparkConf().setAppName("SMMwithAlerts")
.set("spark.streaming.stopGracefullyOnShutdown", "true");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(60));
String brokers = _brokers;
String topics = _topic;
Map<String, Object> kafkaParams = new HashMap<String, Object>();
kafkaParams.put("metadata.broker.list", brokers);
kafkaParams.put("serializer.encoding", "windows-1254");
kafkaParams.put("group.id", _consumerID);
// parameters 4 secure kafka (SSL+KERBEROS)
kafkaParams.put("bootstrap.servers", brokers);
kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("security.protocol", "SASL_SSL");
kafkaParams.put("ssl.keystore.type", "JKS");
kafkaParams.put("ssl.truststore.type", "JKS");
kafkaParams.put("ssl.enabled.protocols", "TLSv1.2,TLSv1.1,TLSv1");
kafkaParams.put("ssl.truststore.location", "/tmp/kafkaKeystore");
kafkaParams.put("ssl.truststore.password", "hardrock");
kafkaParams.put("ssl.secure.random.implementation", "SHA1PRNG");
kafkaParams.put("sasl.kerberos.service.name", "kafka");
System.setProperty("java.security.auth.login.config", "/tmp/jaas.conf");
JavaInputDStream<ConsumerRecord<String, String>> kafkaMessages = null;
try {
Map<TopicPartition, Long> offsets = KafkaOffsetManagment.getKafkaOffsetFromZK4Spark2(_zk, _consumerID,
topics);
Collection<String> currentTopic = Arrays.asList(topics);
ConsumerStrategy<String, String> gtConsumerStrategy = ConsumerStrategies.Subscribe(currentTopic,
kafkaParams);
kafkaMessages = KafkaUtils.createDirectStream((JavaStreamingContext) jssc,
(LocationStrategy) LocationStrategies.PreferConsistent(), gtConsumerStrategy);
kafkaMessages.foreachRDD(new VoidFunction<JavaRDD<ConsumerRecord<String, String>>>() {
@Override
public void call(JavaRDD<ConsumerRecord<String, String>> rdd) throws Exception {
rdd.collect().forEach(new java.util.function.Consumer<ConsumerRecord<String, String>>() {
@Override
public void accept(ConsumerRecord<String, String> consumerRecord) {
String maInput = "";
maInput = consumerRecord.value();
System.out.println("----- value : " + maInput);
}
});
List<SMMContent> sasInput = new ArrayList<>();;
Dataset<Row> resdf = null;
try{
SQLContext sqlContext = SQLContext.getOrCreate(rdd.context());
Dataset<Row> df = sqlContext.read().schema(Schemas4SocialMedia.getSmmJsonSchema()).json(((org.apache.spark.api.java.JavaRDD) rdd));
df.registerTempTable("smmContent");
resdf = sqlContext.sql("select CASE WHEN ID IS NULL THEN '' ELSE ID END AS ID,CASE WHEN ID_FROM_CHANNEL IS NULL THEN '' ELSE ID_FROM_CHANNEL END AS ID_FROM_CHANNEL FROM smmContent");
}
catch(Exception ex)
{
System.err.println("Error");
System.exit(1);
}
JavaRDD<SMMContent> smmData = resdf
.toJavaRDD()
.map(new Function<Row, SMMContent>() {
public SMMContent call(Row row) throws JSONException, UnsupportedEncodingException {
return new SMMContent(row);
}
});
// ### the code works successfully until here###
//alternative-1 the code below fails with org.apache.kafka.clients.consumer.ConsumerRecord cannot be cast to java.lang.String
sasInput = smmData.collect();
//alternative-2 the code below fails with org.apache.kafka.clients.consumer.ConsumerRecord cannot be cast to java.lang.String
smmData.collect().forEach( (newVal) -> {
sasInput.add(newVal);
System.out.println("-+-+-+baris : " + newVal.getAUTO_HYS());
});
//alternative-3 the code below fails with org.apache.kafka.clients.consumer.ConsumerRecord cannot be cast to java.lang.String
smmData.collect().forEach(new java.util.function.Consumer<SMMContent>() {
@Override
public void accept(SMMContent currSmm) {
String maInput = "";
maInput = currSmm.getCHANNEL_CATEGORY_ID();
System.out.println("----- QQQQQ : " + maInput);
sasInput.add(currSmm);
}
});
}
});
} catch (Exception ex) {
System.err.println(ErrorMessages.getStackTrace(ErrorMessages.kafkaConsumeErr, ex));
System.exit(1);
}
return jssc;
}
}
正如我在评论专栏中所提到的,代码可以成功运行,直到我从通过sparksql生成的数据集创建的收集RDD。但是当收集RDD时,它会抛出我上面提到的错误。我失踪的是什么东西?
我已经尝试了3种不同的方法从RDD获取数据,所有这些方法都返回了相同的错误。错误中最奇怪的是它说我在rdd中使用ConsumerRecord。但实际上只有rdd具有ConsumerRecord是在kafka消费期间从spark创建的第一个rdd。但是第二个rdd存储了名为SMMContent的自定义java类。但是错误声称不同的东西。
版本
Spark版本是2.2 卡夫卡版本是0.10 Java版本1.8
Maven依赖
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.2.0</version>
</dependency>