通过Spark Streaming将数据错误地写入或读取到HDFS

时间:2018-09-07 16:02:41

标签: java apache-spark hdfs spark-streaming

我有一个火花流传输过程,该过程消耗了从kafka主题到JavaPairDStream对象的流数据,并且我想将此数据存储在HDFS中。我尝试了两种方法,这两种方法都可以工作,但是也给我带来了问题,我不确定我的问题是将数据写入HDFS还是从HDFS读取数据。

我正在使用Java在环境中的Windows中的本地计算机上运行所有这些程序。

  1. 我的首次尝试是遍历流,并使用jPairRDD.saveAsTextFile("hdfs://localhost:9000/test/");将每个rdd写入hdfs。运行此程序没有任何错误,但是当我尝试从“ test”目录读回文件时,什么也没有返回。当我自己进入目录尝试使用hdfs dfs -ls /test/查看文件时,看到的文件是/test/_SUCCESS/test/part-0000或类似文件。

  2. 我的第二次尝试(和当前有效的尝试)正在使用jPairDStream.dstream().saveAsTextFiles("hdfs://localhost:9000/test/", "txt");。这行得通,我可以看到HDFS的“测试”目录中列出的文件,但是当我尝试重新读取它们时,数据量应该是原来的两倍。 我发现其原因是由于某种原因,数据周围有括号,并且当我从HDFS读取数据时,右括号在rdd中的单独一行上。

在示例2的情况下,我的输入数据如下

key, <message msgTime="07-9-2018 15:49:13" mountPoint="HKWS_32" msgLength="107" msgType="1115">0wBrRbAAc7iiQgAg8AAAAAAAAAAgAQEAf95g3v/8fXDoL+FAif+0ENAhcEqGww2lGzsbmjfscJgMbwA6sAEI6h8CoIbU4ikng4eYDrYAQGxf/////gDHLS3K+y6sgdkDwiTBXUK5hgj7R/aP5ggAAIA=</message>

然后从hdfs中返回的数据如下所示:

(key, <message msgTime="07-9-2018 15:49:13" mountPoint="HKWS_32" msgLength="107" msgType="1115">0wBrRbAAc7iiQgAg8AAAAAAAAAAgAQEAf95g3v/8fXDoL+FAif+0ENAhcEqGww2lGzsbmjfscJgMbwA6sAEI6h8CoIbU4ikng4eYDrYAQGxf/////gDHLS3K+y6sgdkDwiTBXUK5hgj7R/aP5ggAAIA=</message>

)

因此rdd有一个包含)的额外条目。我不知道为什么会这样。我无法确定这是如何将数据写入HDFS或如何从那里读取数据的问题。我所有班级的代码都在下面。谁能帮助我确定为什么会这样吗?下面的代码将同时进行两次尝试。

Spark类,该类使用来自kafka的流数据并将其写入hdfs

public class FatStreamProcessing implements Runnable, Serializable {

    private static final long serialVersionUID = 1L;

    private static Logger logger = Logger.getLogger(FatStreamProcessing.class);

    static Map<String, Object> kafkaParams = new HashMap<>();
    private static final String inTopic = "fatTopicIn";
    private static final Streamery streamery = new Streamery();

    @Override
    public void run() {

        //Set logging level for console
        Logger.getLogger("org").setLevel(Level.ERROR);

        //Set spark context to use all cores and batch interval of 1 second and the job name
        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("SampleSparkKafkaStreamApp");
        JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));

        setupConsumerProperties();

        //Topic the process listens to
        Collection<String> topics = Arrays.asList(inTopic);

        //Create DStream that subscribes to the list of Kafka topics
        final JavaInputDStream<ConsumerRecord<String, String>> stream =
                KafkaUtils.createDirectStream(
                        jssc,
                        LocationStrategies.PreferConsistent(),
                        ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
                );

        logger.info("Direct Stream created for Fat Stream Consumer");

        //Map kafka input to key value tuple
        JavaPairDStream<String, String> jPairDStream =  stream.mapToPair(
                new PairFunction<ConsumerRecord<String, String>, String, String>() {
                    @Override
                    public Tuple2<String, String> call(ConsumerRecord<String, String> record) throws Exception {
                        return new Tuple2<>(record.key(), record.value());
                    }
                });

        //Save files to hdfs
        //This is the first attempt (see streamery class for implementation)
        streamery.saveToHDFS(jPairDStream);

        //This is the second attempt  
       jPairDStream.dstream().saveAsTextFiles("hdfs://localhost:9000/test/", "txt");

        try {
            jssc.start();
            jssc.awaitTermination();
        }
        catch (InterruptedException e) {
            e.printStackTrace();
        }
    }

    /**
     * Configure kafka consumer parameters
     * Example taken from KafkaConsumer documentation
     */
    private static void setupConsumerProperties() {
        kafkaParams.put("bootstrap.servers", "localhost:9092");
        kafkaParams.put("group.id", "test");
        kafkaParams.put("enable.auto.commit", "true");
        kafkaParams.put("auto.commit.interval.ms", "1000");
        kafkaParams.put("key.deserializer", StringDeserializer.class);
        kafkaParams.put("value.deserializer", StringDeserializer.class);
        kafkaParams.put("key.serializer", StringSerializer.class);
        kafkaParams.put("value.serializer", StringSerializer.class);
    }
}

忽略该类名称,它最初旨在将数据流传输到更多kafka使用者。现在它只将数据存储到hdfs

尝试在HDFS中存储文件以进行第一次尝试的类。

public class Streamery {

    private static Logger logger = Logger.getLogger(Streamery.class);

    private static Map<String, Object> kafkaParams = new HashMap<>();
    private static int ack_sum = 0;
    private static int brs_sum = 0;
    static int totalInHDFS = 0;

    public Streamery() {

    }

    public void saveToHDFS(JavaPairDStream<String, String> jPairDStream) {
        jPairDStream.foreachRDD(jPairRDD -> {
            jPairRDD.saveAsTextFile("hdfs://localhost:9000/test/");
            totalInHDFS += jPairRDD.count();
            logger.info("Saved " + totalInHDFS + " files to hdfs");
        });

    }

    /*
    public void saveToHDFS2(JavaPairDStream<String, String> javaPairDStream) {
        javaPairDStream.foreachRDD(jPairRDD -> {
            jPairRDD.foreach(rdd -> {
                totalInHDFS++;
               System.out.println("RDD conatins: " + rdd._2);
               System.out.println("Total saved to HDFS: " + totalInHDFS);
            });
        });
    }*/
}

用于从HDFS读取数据的类

public class ReadHDFS {

    private static final SimpleDateFormat DATE_FORMAT = new SimpleDateFormat("dd-M-yyyy HH:mm:ss");
    private static final SimpleDateFormat FILE_FORMAT = new SimpleDateFormat("yyyyMdd_HHmmss");

    static int count = 0;

    public static void main(String[] args) throws IOException {

        //FileWriter fileWriter = new FileWriter("hdfsMessages" + FILE_FORMAT.format(new Date()) + ".xml");

        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("TempReadHDFS");

        JavaSparkContext sc = new JavaSparkContext(conf);
        JavaRDD<String> textFile = sc.textFile("hdfs://localhost:9000/test/*");


        List<String> temp = textFile.collect();

        temp.forEach(s -> System.out.println(s));

        System.out.println("Total number of files in HDFS: " + textFile.count());

    }
}

0 个答案:

没有答案