我有一个CSV
文件,解析起来确实很麻烦。它在列中使用双引号和逗号作为列,然后在另一列中使用JSON。示例:
+----------+-------------------+--------------------------------------+-------------------------+
| column1| column2 | jsonColumn1 | jsonColumn2
+----------+-------------------+--------------------------------------+-----------------------
| 201 | "1", "ABC", "92" | [{ "Key1": 200,"Value1": 21 }, |[{"date":"9999-09-26T08:50:06Z","fakenumber":"1-877-488-2364-","fakedata":"4.20","fakedata2":"102332.06"}]
{"Key2": 200, "Value2" : 4}]
+------+--------------------------------------------------------------+---------------------------------
我需要使用Scala提取它,如何使它忽略第2列中的逗号,并为每行追加一个选择键值对作为新列?我希望它看起来像这样
+----------+-------------------+--------------------------------------+-------------------------+-------------------------+--------------------------------
| column1| column2 | jsonColumn1 | jsonColumn2 | jsonColumn1Key | jsonColumnDate
+----------+-------------------+--------------------------------------+-----------------------+----------------+--------------------------------------+
| 201 | "1", "ABC", "92" | Keep Orginal Record |keep original record | 200 | 9999-09-26T08:50:06Z
+------+--------------------------------------------------------------+---------------------------------
到目前为止,我所做的是导入数据,创建schema
(在解析之前),然后使用structfield
将新的架构添加到innerjson
的列中有JSON。
import org.apache.spark.sql.types._
csvSchema = StructType(
.add("column1", StringType, true)
.add("column2", StringType, true)
.add("jsonColumn1", StringType, true)
.add("jsonColumn2", StringType, true)
我遇到的第一个问题是第2列。如何解决此问题?对于CSV中的JSON解析,我将在此处模拟类似的解决方案:split JSON value from CSV file and create new column based on json key in Spark/Scala
编辑
csvfile = sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("quote", "\"")
.option("escape", "\"")
.load("file.csv")
display(csvfile)