来自json的内部数组的spark数据集

时间:2017-08-10 08:54:29

标签: json scala apache-spark apache-spark-sql apache-spark-dataset

我正在尝试将json读入数据集(spark 2.1.1)。不幸的是它不起作用。并失败了:

DECLARE @table1 AS TABLE (EmpID int, ProcessID varchar(1), ID int)
INSERT INTO @table1
    VALUES (2, 'B', 1),
            (1, 'A', 2),
            (3, 'C', 3);

DECLARE @table2 AS TABLE (EmpID int, ProcessID varchar(1), ID int)
INSERT INTO @table2
    VALUES (1, 'F', 1),
           (2, 'E', 2);

WITH united AS
(
    SELECT EmpID, ProcessID, ID, 1 AS tableNum
        FROM @table1
    UNION
    SELECT EmpID, ProcessID, (SELECT t1.ID FROM @table1 t1 WHERE t1.EmpID = t2.EmpID) AS ID, 2 AS tableNum
        FROM @table2 t2
)
SELECT EmpId, ProcessID
    FROM united        
    WHERE ID IS NOT NULL
    ORDER BY ID, tableNum;

任何想法我做错了什么?

EmpID       ProcessID
----------- ---------
2           B
2           E
1           A
1           F
3           C

1 个答案:

答案 0 :(得分:2)

通常情况下,如果某个字段丢失,请使用Option

case class Owner(id: String, pets: Seq[Pet])
case class Pet(name: String, age: Option[Long])

nullable类型:

case class Owner(id: String, pets: Seq[Pet])
case class Pet(name: String, age: java.lang.Long)

但这个确实看起来像个错误。我在Spark 2.2中对此进行了测试,现在已经解决了。我认为快速解决方法是保持字段按名称排序:

case class Owner(id: String, pets: Seq[Pet])
case class Pet(age: java.lang.Long, name: String)