Question

我希望从具有数组的深度嵌套结构访问不同的字段/子字段，以便对其进行算术运算。一些数据实际上在字段名称本身中（我必须访问的结构是通过这种方式创建的，对此我无能为力）。特别是，我有一个数字列表，必须使用这些数字作为字段名称，它们将从一个json文件更改为下一个json文件，因此我必须动态推断这些字段名称是什么，然后将它们与子字段值一起使用。

我已经看过这个：Access names of fields in struct Spark SQL 不幸的是，我不知道结构的字段名称是什么，所以我不能使用它。

我也尝试过这样做，看起来很有希望：how to extract the column name and data type from nested struct type in spark 不幸的是，无论“展平”功能如何发挥作用，我都无法使其适应字段名而不是字段本身。

这是一个示例json数据集。它代表了消费篮：

两个购物篮“ comp A”和“ comp B”中的每一个都有多个价格作为子字段：compA.'55.80'是一个价格，compA.'132.88'是另一个价格，依此类推。
我希望将这些单价与各自子字段中的可用数量关联：compA.'55.80'.comment [0] .qty（500），以及compA.'55.80'.comment [0] .qty （600），都应与55.80相关联。 compA.'132.88'.comment [0] .qty（700）应该与132.88相关联。等

{"type":"test","name":"john doe","products":{
    "baskets":{
        "comp A":{
            "55.80":[{"type":"fun","comment":{"qty":500,"text":"hello"}},{"type":"work","comment":{"qty":600,"text":"hello"}}]
            ,"132.88":[{"type":"fun","comment":{"qty":700,"text":"hello"}}]
            ,"0.03":[{"type":"fun","comment":{"qty":500,"text":"hello"}},{"type":"work","comment":{"qty":600,"text":"hello"}}]
        }
        ,"comp B":{
            "55.70":[{"type":"fun","comment":{"qty":500,"text":"hello"}},{"type":"work","comment":{"qty":600,"text":"hello"}}]
            ,"132.98":[{"type":"fun","comment":{"qty":300,"text":"hello"}},{"type":"work","comment":{"qty":900,"text":"hello"}}]
            ,"0.01":[{"type":"fun","comment":{"qty":400,"text":"hello"}}]
        }
    }
}}

我想在数据框中获取所有这些数字，以便对其进行一些操作：

+ -------+---------+----------+
+ basket | price   | quantity +
+ -------+---------+----------+
+ comp A | 55.80   | 500      +
+ comp A | 55.80   | 600      +
+ comp A | 132.88  | 700      +
+ comp A | 0.03    | 500      +
+ comp A | 0.03    | 600      +
+ comp B | 55.70   | 500      +
+ comp B | 55.70   | 600      +
+ comp B | 132.98  | 300      +
+ comp B | 132.98  | 900      +
+ comp B | 0.01    | 400      +
+ -------+---------+----------+

按以下方式访问原始数据集：

scala> myDs
res135: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [products: struct<baskets: struct<compA: struct<55.80: array<struct .....

Answer 1

这种处理作为列名输入的数据的方法不是可遵循的方法。根本行不通。

使用Spark访问嵌套在结构中的json数组

1 个答案: