使用Java API,我编写了一个火花流应用程序,可正确处理和打印结果,现在我想将结果写入HDFS。版本如下:
Hadoop
2.7.3
Spark
2.2.0
Java
1.8
以下是代码:
import java.util.*;
import org.apache.spark.SparkConf;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.kafka010.*;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.kafka.common.serialization.ByteArrayDeserializer;
public class Spark {
public static void main(String[] args) throws InterruptedException {
SparkConf conf = new SparkConf().setAppName("Spark Streaming").setMaster("local[*]");
JavaStreamingContext ssc = new JavaStreamingContext(conf, new Duration(1000));
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", "kafka1:9092,kafka2:9092");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", ByteArrayDeserializer.class);
kafkaParams.put("group.id", "use");
kafkaParams.put("auto.offset.reset", "earliest");
kafkaParams.put("enable.auto.commit", false);
Collection<String> topics = Arrays.asList("testStr");
JavaInputDStream<ConsumerRecord<String, byte[]>> stream =
KafkaUtils.createDirectStream(
ssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, byte[]>Subscribe(topics, kafkaParams)
);
stream.map(record -> finall(record.value())).map(record -> Arrays.deepToString(record)).dstream().saveAsTextFiles(
"spark", "txt"
);
ssc.start();
ssc.awaitTermination();
}
public static String[][] finall(byte[] record){
String[][] result = new String[4][];
result[0] = javaTest.bytePrintable(record);
result[1] = javaTest.hexTodecimal(record);
result[2] = javaTest.hexToOctal(record);
result[3] = javaTest.hexTobin(record);
return result;
}
}
但HDFS和本地文件系统都没有错误:
ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileSystem$Statistics.getThreadStatistics()Lorg/apache/hadoop/fs/FileSystem$Statistics$StatisticsData;
问题是什么?是否需要从Hadoop导入一些库?
更新
我使用本地火花罐而不是Maven依赖,而且它有效。所以依赖中的某些东西是错误的。以下是POM.xml
文件:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.2.0</version>
</dependency>
哪一个不兼容?或者可能缺少某些东西!