我想使用spark streaming job从HDFS目录中读取一个JSON文件。
我使用fileStream
提供的JavaStreamingContext
方法。我正在使用TextInputFormat
,因为我的JSON文件将是单行字符串。
使用TextInputFormat
我希望将值作为字符串读取,然后使用read().json()
方法创建数据集。
以下是我试图让它发挥作用的代码。
SparkConf config = new SparkConf().setAppName("HDFS Streaming Job");
JavaStreamingContext jsc = new JavaStreamingContext(config,
new Duration(10000));
JavaPairInputDStream<LongWritable, Text> fileLines = null;
// new line to be more readable!
fileLines =jsc.fileStream(args[0],
LongWritable.class,
Text.class,TextInputFormat.class);
JavaDStream<String> dstream = fileLines.map(line->{return line._2.toString();});
dstream.foreachRDD(new VoidFunction<JavaRDD<String>>() {
private static final long serialVersionUID = 1L;
@Override
public void call(JavaRDD<String> rdd) {
JavaRDD<Row> rowRDD = rdd.map(new Function<String, Row>() {
private static final long serialVersionUID = 1L;
@Override
public Row call(String msg) {
Row row = RowFactory.create(msg);
return row;
}
});
SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
Dataset<Row> jsonDataset = spark.read().json(rdd);
jsonDataset.show();
}
});
当我尝试阅读JSON时,Dataset
正在创建corrupt_record。
日志输出:
++
||
++
++
++
||
++
++
++
||
++
++
+--------------------+
| _corrupt_record|
+--------------------+
|T", "generatedAt...|
|_Contract", "gen...|
|Code": "SUCCESS",...|
+--------------------+
++
||
++
++
++
||
++
++
++
||
++
++
++
||
++
++
++
||
++
++
++
||
++
++
++
||
++
++
++
||
++
++
++
||
++
++
++
||
++
++
++
||
++
++
++
||
++
++
++
||
++
++
++
||
++
++
++
||
++
++
++
||
++
++
++
||
++
++
示例JSON:
{"pr":"ALERT","ga":"09809","ci":"NMIX","wfid":"WF","spi":"S_01","ssi":"","pi":"01","mi":"01","nea":{"hex":[{"si":"945d1","ni":"01","hi":"01","pi":"K9","et":"MODULE","hei":8798,"bn":""}],"fn":[{"si":"","ni":"","hi":"HH_001","pt":"K9","st":"","fni":"","fnn":"","mc":"","mp":"","dc":""}],"sx":[{"sni":"5d1","ni":"","hi":"","pi":"K9","et":"","st":"","sv":"10","si":"","bh":""}],"ad":[{"sni":"945d1","ni":"01","hi":"001","pi":"K9","st":"","pt":"","sv":"10","pui":" ","hn":"ah","mc":"HIGH","mr":"","pri":""}]}}