我有一个火花流传输过程,该过程消耗了从kafka主题到JavaPairDStream
对象的流数据,并且我想将此数据存储在HDFS中。我尝试了两种方法,这两种方法都可以工作,但是也给我带来了问题,我不确定我的问题是将数据写入HDFS还是从HDFS读取数据。
我正在使用Java在环境中的Windows中的本地计算机上运行所有这些程序。
我的首次尝试是遍历流,并使用jPairRDD.saveAsTextFile("hdfs://localhost:9000/test/");
将每个rdd写入hdfs。运行此程序没有任何错误,但是当我尝试从“ test”目录读回文件时,什么也没有返回。当我自己进入目录尝试使用hdfs dfs -ls /test/
查看文件时,看到的文件是/test/_SUCCESS
和/test/part-0000
或类似文件。
我的第二次尝试(和当前有效的尝试)正在使用jPairDStream.dstream().saveAsTextFiles("hdfs://localhost:9000/test/", "txt");
。这行得通,我可以看到HDFS的“测试”目录中列出的文件,但是当我尝试重新读取它们时,数据量应该是原来的两倍。
我发现其原因是由于某种原因,数据周围有括号,并且当我从HDFS读取数据时,右括号在rdd中的单独一行上。
在示例2的情况下,我的输入数据如下
key, <message msgTime="07-9-2018 15:49:13" mountPoint="HKWS_32" msgLength="107" msgType="1115">0wBrRbAAc7iiQgAg8AAAAAAAAAAgAQEAf95g3v/8fXDoL+FAif+0ENAhcEqGww2lGzsbmjfscJgMbwA6sAEI6h8CoIbU4ikng4eYDrYAQGxf/////gDHLS3K+y6sgdkDwiTBXUK5hgj7R/aP5ggAAIA=</message>
然后从hdfs中返回的数据如下所示:
(key, <message msgTime="07-9-2018 15:49:13" mountPoint="HKWS_32" msgLength="107" msgType="1115">0wBrRbAAc7iiQgAg8AAAAAAAAAAgAQEAf95g3v/8fXDoL+FAif+0ENAhcEqGww2lGzsbmjfscJgMbwA6sAEI6h8CoIbU4ikng4eYDrYAQGxf/////gDHLS3K+y6sgdkDwiTBXUK5hgj7R/aP5ggAAIA=</message>
)
因此rdd有一个包含)
的额外条目。我不知道为什么会这样。我无法确定这是如何将数据写入HDFS或如何从那里读取数据的问题。我所有班级的代码都在下面。谁能帮助我确定为什么会这样吗?下面的代码将同时进行两次尝试。
Spark类,该类使用来自kafka的流数据并将其写入hdfs
public class FatStreamProcessing implements Runnable, Serializable {
private static final long serialVersionUID = 1L;
private static Logger logger = Logger.getLogger(FatStreamProcessing.class);
static Map<String, Object> kafkaParams = new HashMap<>();
private static final String inTopic = "fatTopicIn";
private static final Streamery streamery = new Streamery();
@Override
public void run() {
//Set logging level for console
Logger.getLogger("org").setLevel(Level.ERROR);
//Set spark context to use all cores and batch interval of 1 second and the job name
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("SampleSparkKafkaStreamApp");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));
setupConsumerProperties();
//Topic the process listens to
Collection<String> topics = Arrays.asList(inTopic);
//Create DStream that subscribes to the list of Kafka topics
final JavaInputDStream<ConsumerRecord<String, String>> stream =
KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
);
logger.info("Direct Stream created for Fat Stream Consumer");
//Map kafka input to key value tuple
JavaPairDStream<String, String> jPairDStream = stream.mapToPair(
new PairFunction<ConsumerRecord<String, String>, String, String>() {
@Override
public Tuple2<String, String> call(ConsumerRecord<String, String> record) throws Exception {
return new Tuple2<>(record.key(), record.value());
}
});
//Save files to hdfs
//This is the first attempt (see streamery class for implementation)
streamery.saveToHDFS(jPairDStream);
//This is the second attempt
jPairDStream.dstream().saveAsTextFiles("hdfs://localhost:9000/test/", "txt");
try {
jssc.start();
jssc.awaitTermination();
}
catch (InterruptedException e) {
e.printStackTrace();
}
}
/**
* Configure kafka consumer parameters
* Example taken from KafkaConsumer documentation
*/
private static void setupConsumerProperties() {
kafkaParams.put("bootstrap.servers", "localhost:9092");
kafkaParams.put("group.id", "test");
kafkaParams.put("enable.auto.commit", "true");
kafkaParams.put("auto.commit.interval.ms", "1000");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("key.serializer", StringSerializer.class);
kafkaParams.put("value.serializer", StringSerializer.class);
}
}
忽略该类名称,它最初旨在将数据流传输到更多kafka使用者。现在它只将数据存储到hdfs
尝试在HDFS中存储文件以进行第一次尝试的类。
public class Streamery {
private static Logger logger = Logger.getLogger(Streamery.class);
private static Map<String, Object> kafkaParams = new HashMap<>();
private static int ack_sum = 0;
private static int brs_sum = 0;
static int totalInHDFS = 0;
public Streamery() {
}
public void saveToHDFS(JavaPairDStream<String, String> jPairDStream) {
jPairDStream.foreachRDD(jPairRDD -> {
jPairRDD.saveAsTextFile("hdfs://localhost:9000/test/");
totalInHDFS += jPairRDD.count();
logger.info("Saved " + totalInHDFS + " files to hdfs");
});
}
/*
public void saveToHDFS2(JavaPairDStream<String, String> javaPairDStream) {
javaPairDStream.foreachRDD(jPairRDD -> {
jPairRDD.foreach(rdd -> {
totalInHDFS++;
System.out.println("RDD conatins: " + rdd._2);
System.out.println("Total saved to HDFS: " + totalInHDFS);
});
});
}*/
}
用于从HDFS读取数据的类
public class ReadHDFS {
private static final SimpleDateFormat DATE_FORMAT = new SimpleDateFormat("dd-M-yyyy HH:mm:ss");
private static final SimpleDateFormat FILE_FORMAT = new SimpleDateFormat("yyyyMdd_HHmmss");
static int count = 0;
public static void main(String[] args) throws IOException {
//FileWriter fileWriter = new FileWriter("hdfsMessages" + FILE_FORMAT.format(new Date()) + ".xml");
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("TempReadHDFS");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> textFile = sc.textFile("hdfs://localhost:9000/test/*");
List<String> temp = textFile.collect();
temp.forEach(s -> System.out.println(s));
System.out.println("Total number of files in HDFS: " + textFile.count());
}
}