使用Scala用JSON数组解析CSV文件

时间:2018-10-11 21:00:42

标签: json scala csv apache-spark databricks

我有一个CSV文件,解析起来确实很麻烦。它在列中使用双引号和逗号作为列,然后在另一列中使用JSON。示例:

+----------+-------------------+--------------------------------------+-------------------------+
 |  column1|  column2          |           jsonColumn1                | jsonColumn2
+----------+-------------------+--------------------------------------+-----------------------
|  201     |  "1", "ABC", "92" | [{ "Key1": 200,"Value1": 21 },       |[{"date":"9999-09-26T08:50:06Z","fakenumber":"1-877-488-2364-","fakedata":"4.20","fakedata2":"102332.06"}]
                                 {"Key2": 200, "Value2" : 4}]  
+------+--------------------------------------------------------------+---------------------------------

我需要使用Scala提取它,如何使它忽略第2列中的逗号,并为每行追加一个选择键值对作为新列?我希望它看起来像这样

+----------+-------------------+--------------------------------------+-------------------------+-------------------------+--------------------------------
 |  column1|  column2          |           jsonColumn1                | jsonColumn2          |  jsonColumn1Key             | jsonColumnDate
+----------+-------------------+--------------------------------------+-----------------------+----------------+--------------------------------------+
|  201     |  "1", "ABC", "92" |       Keep Orginal Record            |keep original record  |      200                    | 9999-09-26T08:50:06Z

+------+--------------------------------------------------------------+---------------------------------

到目前为止,我所做的是导入数据,创建schema(在解析之前),然后使用structfield将新的架构添加到innerjson的列中有JSON。

import org.apache.spark.sql.types._

csvSchema = StructType(
        .add("column1", StringType, true)
        .add("column2", StringType, true)
        .add("jsonColumn1", StringType, true)
        .add("jsonColumn2", StringType, true)

我遇到的第一个问题是第2列。如何解决此问题?对于CSV中的JSON解析,我将在此处模拟类似的解决方案:split JSON value from CSV file and create new column based on json key in Spark/Scala

编辑

 csvfile = sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("quote", "\"")
.option("escape", "\"")
.load("file.csv")
 display(csvfile)

0 个答案:

没有答案