使用Spark Streaming:Java从HDFS读取JSON文件

时间:2018-04-06 09:45:47

标签: java json apache-spark spark-streaming

我想使用spark streaming job从HDFS目录中读取一个JSON文件。

我使用fileStream提供的JavaStreamingContext方法。我正在使用TextInputFormat,因为我的JSON文件将是单行字符串。

使用TextInputFormat我希望将值作为字符串读取,然后使用read().json()方法创建数据集。

以下是我试图让它发挥作用的代码。

SparkConf config = new SparkConf().setAppName("HDFS Streaming Job");
        JavaStreamingContext jsc = new JavaStreamingContext(config,
                                                            new Duration(10000));

        JavaPairInputDStream<LongWritable, Text> fileLines = null;                   
        // new line to be more readable!
        fileLines =jsc.fileStream(args[0],
                                  LongWritable.class,
                                  Text.class,TextInputFormat.class);

        JavaDStream<String> dstream = fileLines.map(line->{return line._2.toString();});

        dstream.foreachRDD(new VoidFunction<JavaRDD<String>>() {

            private static final long serialVersionUID = 1L;

            @Override
            public void call(JavaRDD<String> rdd) {
                JavaRDD<Row> rowRDD = rdd.map(new Function<String, Row>() {

                    private static final long serialVersionUID = 1L;

                    @Override
                    public Row call(String msg) {
                        Row row = RowFactory.create(msg);
                        return row;
                    }
                });

                SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
                Dataset<Row> jsonDataset = spark.read().json(rdd);
                jsonDataset.show();

            }
        });

当我尝试阅读JSON时,Dataset正在创建corrupt_record。

日志输出:

++
||
++
++

++
||
++
++

++
||
++
++

+--------------------+
|     _corrupt_record|
+--------------------+
|T",  "generatedAt...|
|_Contract",  "gen...|
|Code": "SUCCESS",...|
+--------------------+

++
||
++
++

++
||
++
++

++
||
++
++

++
||
++
++

++
||
++
++

++
||
++
++

++
||
++
++

++
||
++
++

++
||
++
++

++
||
++
++

++
||
++
++

++
||
++
++

++
||
++
++

++
||
++
++

++
||
++
++

++
||
++
++

++
||
++
++

示例JSON:

{"pr":"ALERT","ga":"09809","ci":"NMIX","wfid":"WF","spi":"S_01","ssi":"","pi":"01","mi":"01","nea":{"hex":[{"si":"945d1","ni":"01","hi":"01","pi":"K9","et":"MODULE","hei":8798,"bn":""}],"fn":[{"si":"","ni":"","hi":"HH_001","pt":"K9","st":"","fni":"","fnn":"","mc":"","mp":"","dc":""}],"sx":[{"sni":"5d1","ni":"","hi":"","pi":"K9","et":"","st":"","sv":"10","si":"","bh":""}],"ad":[{"sni":"945d1","ni":"01","hi":"001","pi":"K9","st":"","pt":"","sv":"10","pui":" ","hn":"ah","mc":"HIGH","mr":"","pri":""}]}}

0 个答案:

没有答案