我有一个Dataset<String> ds
,它由json行组成。
示例Json Row(这只是数据集中一行的示例)
[
"{"name": "foo", "address": {"state": "CA", "country": "USA"}, "docs":[{"subject": "english", "year": 2016}]}",
"{"name": "bar", "address": {"state": "OH", "country": "USA"}, "docs":[{"subject": "math", "year": 2017}]}"
]
ds.printSchema()
root
|-- value: string (nullable = true)
现在我想使用Spark 2.2.0转换为以下数据集
name | address | docs
----------------------------------------------------------------------------------
"foo" | {"state": "CA", "country": "USA"} | [{"subject": "english", "year": 2016}]
"bar" | {"state": "OH", "country": "USA"} | [{"subject": "math", "year": 2017}]
只要Java API中有可用的函数
,Java,但Scala也很好这是我到目前为止所尝试的内容
val df = Seq("""["{"name": "foo", "address": {"state": "CA", "country": "USA"}, "docs":[{"subject": "english", "year": 2016}]}", "{"name": "bar", "address": {"state": "OH", "country": "USA"}, "docs":[{"subject": "math", "year": 2017}]}" ]""").toDF
df.show(假)
|value |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|["{"name": "foo", "address": {"state": "CA", "country": "USA"}, "docs":[{"subject": "english", "year": 2016}]}", "{"name": "bar", "address": {"state": "OH", "country": "USA"}, "docs":[{"subject": "math", "year": 2017}]}" ]|
答案 0 :(得分:1)
我在Java中找到了一种解决方法。我希望这会有所帮助。
创建一个Bean类(在我的情况下为TempBean)
import java.util.List;
import java.util.Map;
public class TempBean
{
String name;
Map<String, String> address;
List<Map<String, String>> docs;
public String getName()
{
return name;
}
public void setName(String name)
{
this.name = name;
}
public Map<String, String> getAddress()
{
return address;
}
public void setAddress(Map<String, String> address)
{
this.address = address;
}
public List<Map<String, String>> getDocs()
{
return docs;
}
public void setDocs(List<Map<String, String>> docs)
{
this.docs = docs;
}
}
将以下代码与以下导入配合使用:
//import com.fasterxml.jackson.core.JsonGenerator;
//import com.fasterxml.jackson.core.JsonParseException;
//import com.fasterxml.jackson.core.JsonProcessingException;
//import com.fasterxml.jackson.core.type.TypeReference;
//import com.fasterxml.jackson.databind.JsonMappingException;
//import com.fasterxml.jackson.databind.ObjectMapper;
ObjectMapper mapper = new ObjectMapper();
List<String> dfList = ds.collectAsList(); //using your Dataset<String>
List<TempBean> tempList = new ArrayList<TempBean>();
try
{
for (String json : dfList)
{
List<Map<String, Object>> mapList = mapper.readValue(json, new TypeReference<List<Map<String, Object>>>(){});
for(Map<String,Object> map : mapList)
{
TempBean temp = new TempBean();
temp.setName(map.get("name").toString());
temp.setAddress((Map<String,String>)map.get("address"));
temp.setDocs((List<Map<String,String>>)map.get("docs"));
tempList.add(temp);
}
}
}
catch (JsonParseException e)
{
e.printStackTrace();
}
catch (JsonMappingException e)
{
e.printStackTrace();
}
catch (IOException e)
{
e.printStackTrace();
}
创建数据框:
Dataset<Row> dff = spark.createDataFrame(tempList, TempBean.class);
显示数据库
dff.show(false);
+--------------------------------+---------------------------------------+----+
|address |docs |name|
+--------------------------------+---------------------------------------+----+
|Map(state -> CA, country -> USA)|[Map(subject -> english, year -> 2016)]|foo |
|Map(state -> OH, country -> USA)|[Map(subject -> math, year -> 2017)] |bar |
+--------------------------------+---------------------------------------+----+
打印架构:
dff.printSchema();
root
|-- address: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- docs: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
|-- name: string (nullable = true)