尝试使用Spark 1.4.1数据帧读取JSON文件并在其中导航。 似乎猜测的架构不正确。
JSON文件是:
{
"FILE": {
"TUPLE_CLI": [{
"ID_CLI": "C3-00000004",
"TUPLE_ABO": [{
"ID_ABO": "T0630000000000004",
"TUPLE_CRA": {
"CRA": "T070000550330",
"EFF": "Success"
},
"TITRE_ABO": ["Mr",
"OOESGUCKDO"],
"DATNAISS": "1949-02-05"
},
{
"ID_ABO": "T0630000000100004",
"TUPLE_CRA": [{
"CRA": "T070000080280",
"EFF": "Success"
},
{
"CRA": "T070010770366",
"EFF": "Failed"
}],
"TITRE_ABO": ["Mrs",
"NP"],
"DATNAISS": "1970-02-05"
}]
},
{
"ID_CLI": "C3-00000005",
"TUPLE_ABO": [{
"ID_ABO": "T0630000000000005",
"TUPLE_CRA": [{
"CRA": "T070000200512",
"EFF": "Success"
},
{
"CRA": "T070010410078",
"EFF": "Success"
}],
"TITRE_ABO": ["Miss",
"OB"],
"DATNAISS": "1926-11-22"
}]
}]
}
}
Spark代码是:
val j = sqlContext.read.json("/user/arthur/test.json")
j.printSchema
结果是:
root
|-- FILE: struct (nullable = true)
| |-- TUPLE_CLI: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- ID_CLI: string (nullable = true)
| | | |-- TUPLE_ABO: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- DATNAISS: string (nullable = true)
| | | | | |-- ID_ABO: string (nullable = true)
| | | | | |-- TITRE_ABO: array (nullable = true)
| | | | | | |-- element: string (containsNull = true)
| | | | | |-- TUPLE_CRA: string (nullable = true)
很明显,TUPLE_CRA是一个数组。我无法理解为什么没有猜到。在我看来,推断的架构应该是:
root
|-- FILE: struct (nullable = true)
| |-- TUPLE_CLI: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- ID_CLI: string (nullable = true)
| | | |-- TUPLE_ABO: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- DATNAISS: string (nullable = true)
| | | | | |-- ID_ABO: string (nullable = true)
| | | | | |-- TITRE_ABO: array (nullable = true)
| | | | | | |-- element: string (containsNull = true)
| | | | | |-- TUPLE_CRA: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- CRA: string (nullable = true)
| | | | | | | |-- EFF: string (nullable = true)
有人有解释吗? 如果JSON模式更复杂,有没有办法告诉Spark什么是实际模式?
答案 0 :(得分:2)
好吧,终于明白了JSON不是预期的。 您会注意到第一个TUPLE_CRA是没有方括号[]的元素。 其他TUPLE_CRA是带括号的数组,里面有几个元素。 这就是为什么Spark无法准确猜测结构的原因。 所以问题来自于这个JSON的生成。我需要修改它以使每个TUPLE_CRA成为一个数组,即使内部只有一个元素。