将复杂的嵌套Json转换为JAVA中的Spark Dataframe

时间:2018-05-21 19:02:31

标签: java json apache-spark dataframe nested

任何人都可以帮助使用Java代码将以下JSON转换为Spark Dataframe ..

注意:它不是文件

逻辑: 收听kafka主题T1,读取RDD中的每条记录并应用其他逻辑将结果数据转换为Json Object并将其写入Kafka中的另一个主题T2 ..

T2结构如下。

JSON:

 [  
   {  
      "@tenant_id":"XYZ",
      "alarmUpdateTime":1526342400000,
      "alarm_id":"AB5C9123",
      "alarm_updates":[  
         {  
            "alarmField":"Severity",
            "new_value":"Minor",
            "old_value":"Major"
         },
         {  
            "alarmField":"state",
            "new_value":"UPDATE",
            "old_value":"NEW"
         }
      ],
      "aucID":"5af83",
      "inID":"INC15234567",
      "index":"test",
      "product":"test",
      "source":"ABS",
      "state":"NEW"
   }
]

创建的类:

    ClassAlarm{

        String @tenant_id;
        String alarm_id;
        .
        .
        List <AlarmUpdate> update;
        Get and Setter functions for all variables
    }

AlarmUpdate{

    String alarmField;
    String oldVal;
    String NewVal;

    Get and Setter functions for all variables
} 

AppClass{


     void static main(){
             Alarm alarmObj = new Alarm();
          //set values for variables in alarmObj.
           Dataset <Row> results = jobCtx.getSparkSession().createDataFrame(Arrays.asList(alarmObj), Alarm.class)

           //At this point seeing following errors.

      }

}

错误:

  

2018-05-15 13:40:48错误JobScheduler - 运行作业流时出错   job 1526406040000 ms.0 scala.MatchError:   com.ca.alarmupdates.AlarmUpdate@48c8809b(班级   com.ca.alarmupdates.AlarmUpdate)                   at org.apache.spark.sql.catalyst.CatalystTypeConverters $ StructConverter.toCatalystImpl(CatalystTypeConverters.scala:236)   〜[火花catalyst_2.11-2.2.0.jar:2.2.0]                   at org.apache.spark.sql.catalyst.CatalystTypeConverters $ StructConverter.toCatalystImpl(CatalystTypeConverters.scala:231)   〜[火花catalyst_2.11-2.2.0.jar:2.2.0]                   at org.apache.spark.sql.catalyst.CatalystTypeConverters $ CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)   〜[火花catalyst_2.11-2.2.0.jar:2.2.0]                   at org.apache.spark.sql.catalyst.CatalystTypeConverters $ ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:170)   〜[火花catalyst_2.11-2.2.0.jar:2.2.0]                   at org.apache.spark.sql.catalyst.CatalystTypeConverters $ ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:154)   〜[火花catalyst_2.11-2.2.0.jar:2.2.0]                   at org.apache.spark.sql.catalyst.CatalystTypeConverters $ CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)   〜[火花catalyst_2.11-2.2.0.jar:2.2.0]                   在org.apache.spark.sql.catalyst.CatalystTypeConverters $$ anonfun $ createToCatalystConverter $ 2.apply(CatalystTypeConverters.scala:379)   〜[火花catalyst_2.11-2.2.0.jar:2.2.0]                   at org.apache.spark.sql.SQLContext $$ anonfun $ beansToRows $ 1 $$ anonfun $ apply $ 1.apply(SQLContext.scala:1105)   〜[火花sql_2.11-2.2.0.jar:2.2.0]                   at org.apache.spark.sql.SQLContext $$ anonfun $ beansToRows $ 1 $$ anonfun $ apply $ 1.apply(SQLContext.scala:1105)   〜[火花sql_2.11-2.2.0.jar:2.2.0]                   在scala.collection.TraversableLike $$ anonfun $ map $ 1.apply(TraversableLike.scala:234)   〜[JAF-SDK-2.4.0.jar:?]                   在scala.collection.TraversableLike $$ anonfun $ map $ 1.apply(TraversableLike.scala:234)   〜[JAF-SDK-2.4.0.jar:?]                   在scala.collection.IndexedSeqOptimized $ class.foreach(IndexedSeqOptimized.scala:33)   〜[JAF-SDK-2.4.0.jar:?]                   at scala.collection.mutable.ArrayOps $ ofRef.foreach(ArrayOps.scala:186)   〜[JAF-SDK-2.4.0.jar:?]                   在scala.collection.TraversableLike $ class.map(TraversableLike.scala:234)   〜[JAF-SDK-2.4.0.jar:?]                   在scala.collection.mutable.ArrayOps $ ofRef.map(ArrayOps.scala:186)   〜[JAF-SDK-2.4.0.jar:?]                   at org.apache.spark.sql.SQLContext $$ anonfun $ beansToRows $ 1.apply(SQLContext.scala:1105)   〜[火花sql_2.11-2.2.0.jar:2.2.0]                   at org.apache.spark.sql.SQLContext $$ anonfun $ beansToRows $ 1.apply(SQLContext.scala:1103)   〜[火花sql_2.11-2.2.0.jar:2.2.0]                   在scala.collection.Iterator $$ anon $ 11.next(Iterator.scala:409)   〜[JAF-SDK-2.4.0.jar:?]                   在scala.collection.Iterator $ class.toStream(Iterator.scala:1322)   〜[JAF-SDK-2.4.0.jar:?]                   在scala.collection.AbstractIterator.toStream(Iterator.scala:1336)   〜[JAF-SDK-2.4.0.jar:?]                   在scala.collection.TraversableOnce $ class.toSeq(TraversableOnce.scala:298)   〜[JAF-SDK-2.4.0.jar:?]                   在scala.collection.AbstractIterator.toSeq(Iterator.scala:1336)   〜[JAF-SDK-2.4.0.jar:?]                   在org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:406)   〜[火花sql_2.11-2.2.0.jar:2.2.0]                   at com.ca.alarmupdates.AlarmUpdates.lambda $ null $ 0(AlarmUpdates.java:85)   〜[类/ :?]                   at java.util.Arrays $ ArrayList.forEach(Arrays.java:3880)~ [?:1.8.0_161]                   在com.ca.alarmupdates.AlarmUpdates.lambda $ main $ f87f782d $ 1(AlarmUpdates.java:58)   〜[类/ :?]                   在org.apache.spark.streaming.api.java.JavaDStreamLike $$ anonfun $ foreachRDD $ 1.apply(JavaDStreamLike.scala:272)   〜[火花streaming_2.11-2.2.0.jar:2.2.0]                   在org.apache.spark.streaming.api.java.JavaDStreamLike $$ anonfun $ foreachRDD $ 1.apply(JavaDStreamLike.scala:272)   〜[火花streaming_2.11-2.2.0.jar:2.2.0]                   在org.apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ apply $ mcV $ sp $ 3.apply(DStream.scala:628)   〜[火花streaming_2.11-2.2.0.jar:2.2.0]                   在org.apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ apply $ mcV $ sp $ 3.apply(DStream.scala:628)   〜[火花streaming_2.11-2.2.0.jar:2.2.0]                   在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply $ mcV $ sp(ForEachDStream.scala:51)   〜[火花streaming_2.11-2.2.0.jar:2.2.0]                   在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(ForEachDStream.scala:51)   〜[火花streaming_2.11-2.2.0.jar:2.2.0]                   在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(ForEachDStream.scala:51)   〜[火花streaming_2.11-2.2.0.jar:2.2.0]                   在org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)   〜[火花streaming_2.11-2.2.0.jar:2.2.0]                   在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.apply $ mcV $ sp(ForEachDStream.scala:50)   〜[火花streaming_2.11-2.2.0.jar:2.2.0]                   在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.apply(ForEachDStream.scala:50)   〜[火花streaming_2.11-2.2.0.jar:2.2.0]                   在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.apply(ForEachDStream.scala:50)   〜[火花streaming_2.11-2.2.0.jar:2.2.0]                   在scala.util.Try $ .apply(Try.scala:192)〜[jaf-sdk-2.4.0.jar:?]                   在org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)   〜[火花streaming_2.11-2.2.0.jar:2.2.0]                   在org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $ run $ 1.apply $ mcV $ sp(JobScheduler.scala:257)   〜[火花streaming_2.11-2.2.0.jar:2.2.0]                   在org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $ run $ 1.apply(JobScheduler.scala:257)   〜[火花streaming_2.11-2.2.0.jar:2.2.0]                   在org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $ run $ 1.apply(JobScheduler.scala:257)   〜[火花streaming_2.11-2.2.0.jar:2.2.0]                   在scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)   〜[JAF-SDK-2.4.0.jar:?]                   在org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler.run(JobScheduler.scala:256)   〜[火花streaming_2.11-2.2.0.jar:2.2.0]                   在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)   〜[:?1.8.0_161]                   at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)   〜[:?1.8.0_161]                   在java.lang.Thread.run(Thread.java:748)〜[?:1.8.0_161]

1 个答案:

答案 0 :(得分:0)

您可以使用wholeTextFiles读取json文件并获取json文本,并在json的{​​{1}} api中使用它

SparkSession

你应该

import org.apache.spark.sql.SparkSession;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;

static SparkSession spark = SparkSession.builder().master("local").appName("simple").getOrCreate();
static JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());

Dataset<Row> df = spark.read().json(sc.wholeTextFiles("path to json file").map(t -> t._2()));
df.show(false);

您可以根据需要使用+----------+---------------+--------+--------------------------------------------+-----+-----------+-----+-------+------+-----+ |@tenant_id|alarmUpdateTime|alarm_id|alarm_updates |aucID|inID |index|product|source|state| +----------+---------------+--------+--------------------------------------------+-----+-----------+-----+-------+------+-----+ |XYZ |1526342400000 |AB5C9123|[[Severity,Minor,Major], [state,UPDATE,NEW]]|5af83|INC15234567|test |test |ABS |NEW | +----------+---------------+--------+--------------------------------------------+-----+-----------+-----+-------+------+-----+ master

<强>更新

您评论过

  
    

appName

  

为此,假设你有从T1主题读取的记录为字符串对象

The way you do through file , can we do it with the object . I have to convert to Ingest the data to the other T2

并将其转换为 String t1Record = "[\n" + " {\n" + " \"@tenant_id\":\"XYZ\",\n" + " \"alarmUpdateTime\":1526342400000,\n" + " \"alarm_id\":\"AB5C9123\",\n" + " \"alarm_updates\":[\n" + " {\n" + " \"alarmField\":\"Severity\",\n" + " \"new_value\":\"Minor\",\n" + " \"old_value\":\"Major\"\n" + " },\n" + " {\n" + " \"alarmField\":\"state\",\n" + " \"new_value\":\"UPDATE\",\n" + " \"old_value\":\"NEW\"\n" + " }\n" + " ],\n" + " \"aucID\":\"5af83\",\n" + " \"inID\":\"INC15234567\",\n" + " \"index\":\"test\",\n" + " \"product\":\"test\",\n" + " \"source\":\"ABS\",\n" + " \"state\":\"NEW\"\n" + " }\n" + "]";

RDD

然后,您可以应用 JavaRDD<String> t1RecordRDD = sc.parallelize(Arrays.asList(t1Record)); api转换为json

dataframe

应该给你与上面相同的结果