在条件上删除DataFrame(JSON)中的嵌套数组条目

时间:2017-11-16 17:44:18

标签: scala apache-spark dataframe rdd

我在DataFrame中读取了一个巨大的文件,其中包含JSON对象的每一行,如下所示:

{
  "userId": "12345",
  "vars": {
    "test_group": "group1",
    "brand": "xband"
  },
  "modules": [
    {
      "id": "New"
    },
    {
      "id": "Default"
    },
    {
      "id": "BestValue"
    },
    {
      "id": "Rating"
    },
    {
      "id": "DeliveryMin"
    },
    {
      "id": "Distance"
    }
  ]
}

我怎么能以这种方式操纵DataFrame,只保留模块 id =“Default”?如果 id 不等于“默认”,如何删除所有其他内容?

1 个答案:

答案 0 :(得分:1)

如你所说,你在每一行中都有json格式为

{"userId":"12345","vars":{"test_group":"group1","brand":"xband"},"modules":[{"id":"New"},{"id":"Default"},{"id":"BestValue"},{"id":"Rating"},{"id":"DeliveryMin"},{"id":"Distance"}]}
{"userId":"12345","vars":{"test_group":"group1","brand":"xband"},"modules":[{"id":"New"},{"id":"Default"},{"id":"BestValue"},{"id":"Rating"},{"id":"DeliveryMin"},{"id":"Distance"}]}

如果这是真的,那么您可以使用sqlContext的{​​{1}} api将json文件读取到json,如下所示

dataframe

应该为val df = sqlContext.read.json("path to json file") 提供

dataframe

+--------------------------------------------------------------------+------+--------------+ |modules |userId|vars | +--------------------------------------------------------------------+------+--------------+ |[[New], [Default], [BestValue], [Rating], [DeliveryMin], [Distance]]|12345 |[xband,group1]| |[[New], [Default], [BestValue], [Rating], [DeliveryMin], [Distance]]|12345 |[xband,group1]| +--------------------------------------------------------------------+------+--------------+

schema

最后一步是root |-- modules: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- id: string (nullable = true) |-- userId: string (nullable = true) |-- vars: struct (nullable = true) | |-- brand: string (nullable = true) | |-- test_group: string (nullable = true) filter modules.id作为值

Default

应该给你

val finaldf = df.withColumn("modules", explode($"modules.id"))
    .filter($"modules" === "Default")

我希望答案很有帮助

<强>更新

这会将+-------+------+--------------+ |modules|userId|vars | +-------+------+--------------+ |Default|12345 |[xband,group1]| |Default|12345 |[xband,group1]| +-------+------+--------------+ 创建为

json

但如果你的要求是如下所示

{"modules":"Default","userId":"12345","vars":{"brand":"xband","test_group":"group1"}}
{"modules":"Default","userId":"12345","vars":{"brand":"xband","test_group":"group1"}}

你应该爆炸 {"modules":{"id":"Default"},"userId":"12345","vars":{"brand":"xband","test_group":"group1"}} {"modules":{"id":"Default"},"userId":"12345","vars":{"brand":"xband","test_group":"group1"}} 而不是modules

modules.id