如何将Json Strings数组转换为Spark 2.2.0中特定列的数据集?

时间:2017-10-07 06:50:05

标签: apache-spark

我有一个Dataset<String> ds,它由json行组成。

示例Json Row(这只是数据集中一行的示例)

[ 
    "{"name": "foo", "address": {"state": "CA", "country": "USA"}, "docs":[{"subject": "english", "year": 2016}]}", 
    "{"name": "bar", "address": {"state": "OH", "country": "USA"}, "docs":[{"subject": "math", "year": 2017}]}"

]

ds.printSchema()

root
 |-- value: string (nullable = true)

现在我想使用Spark 2.2.0转换为以下数据集

name  |             address               |  docs 
----------------------------------------------------------------------------------
"foo" | {"state": "CA", "country": "USA"} | [{"subject": "english", "year": 2016}]
"bar" | {"state": "OH", "country": "USA"} | [{"subject": "math", "year": 2017}]

只要Java API中有可用的函数

,Java,但Scala也很好

这是我到目前为止所尝试的内容

val df = Seq("""["{"name": "foo", "address": {"state": "CA", "country": "USA"}, "docs":[{"subject": "english", "year": 2016}]}", "{"name": "bar", "address": {"state": "OH", "country": "USA"}, "docs":[{"subject": "math", "year": 2017}]}" ]""").toDF

df.show(假)

|value                                                                                                                                                                                                                     |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|["{"name": "foo", "address": {"state": "CA", "country": "USA"}, "docs":[{"subject": "english", "year": 2016}]}", "{"name": "bar", "address": {"state": "OH", "country": "USA"}, "docs":[{"subject": "math", "year": 2017}]}" ]|

1 个答案:

答案 0 :(得分:1)

我在Java中找到了一种解决方法。我希望这会有所帮助。

创建一个Bean类(在我的情况下为TempBean)

import java.util.List;
import java.util.Map;

public class TempBean
    {
        String name;
        Map<String, String> address;
        List<Map<String, String>> docs;
        public String getName()
            {
                return name;
            }
        public void setName(String name)
            {
                this.name = name;
            }
        public Map<String, String> getAddress()
            {
                return address;
            }
        public void setAddress(Map<String, String> address)
            {
                this.address = address;
            }
        public List<Map<String, String>> getDocs()
            {
                return docs;
            }
        public void setDocs(List<Map<String, String>> docs)
            {
                this.docs = docs;
            }

    }

将以下代码与以下导入配合使用:

//import com.fasterxml.jackson.core.JsonGenerator;
//import com.fasterxml.jackson.core.JsonParseException;
//import com.fasterxml.jackson.core.JsonProcessingException;
//import com.fasterxml.jackson.core.type.TypeReference;
//import com.fasterxml.jackson.databind.JsonMappingException;
//import com.fasterxml.jackson.databind.ObjectMapper;

ObjectMapper mapper = new ObjectMapper();
List<String> dfList = ds.collectAsList(); //using your Dataset<String>
List<TempBean> tempList = new ArrayList<TempBean>();
try
    {
        for (String json : dfList)
            {
             List<Map<String, Object>> mapList = mapper.readValue(json, new TypeReference<List<Map<String, Object>>>(){});
             for(Map<String,Object> map : mapList)
             {
                TempBean temp = new TempBean();
                temp.setName(map.get("name").toString());
             temp.setAddress((Map<String,String>)map.get("address"));
             temp.setDocs((List<Map<String,String>>)map.get("docs"));
             tempList.add(temp);
             }
            }
    }
catch (JsonParseException e)
    {
        e.printStackTrace();
    }
catch (JsonMappingException e)
    {
        e.printStackTrace();
    }
catch (IOException e)
    {
        e.printStackTrace();
    }

创建数据框:

Dataset<Row> dff = spark.createDataFrame(tempList, TempBean.class);

显示数据库

dff.show(false);
+--------------------------------+---------------------------------------+----+
|address                         |docs                                   |name|
+--------------------------------+---------------------------------------+----+
|Map(state -> CA, country -> USA)|[Map(subject -> english, year -> 2016)]|foo |
|Map(state -> OH, country -> USA)|[Map(subject -> math, year -> 2017)]   |bar |
+--------------------------------+---------------------------------------+----+

打印架构:

dff.printSchema();
root
 |-- address: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- docs: array (nullable = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |-- name: string (nullable = true)