我有一个嵌套的mongodb talbe,其文档结构如下:
{
"_id" : "35228334dbd1090f6117c5a0011b56b0",
"brasidas" : [
{
"key" : "buy",
"value" : 859193
}
],
"crawl_time" : NumberLong(1526296211997),
"date" : "2018-05-11",
"id" : "44874f4c8c677087bcd5f829b2843e66",
"initNumber" : 0,
"repurchase" : 0,
"source_url" : "http://query.sse.com.cn/commonQuery.do?jsonCallBack=jQuery11120015170331124618408_1526262411932&isPagination=true&sqlId=COMMON_SSE_SCSJ_CJGK_ZQZYSHG_JYSLMX_L&beginDate&endDate&securityCode&pageHelp.pageNo=1&pageHelp.beginPage=1&pageHelp.cacheSize=1&pageHelp.endPage=1&pageHelp.pageSize=25",
"stockCode" : "600020",
"stockName" : "ZYGS",
"type" : "SSE"
}
我想将其转换为spark数据框,并分别提取“ brasidas ”的标题“ key
”和“ value
”作为单列。如下所示:
initNumber repurchase key value stockName type date
50000 50000 buy 286698 shgf SSE 2015/3/30
但是标题“ brasidas”的形式存在问题,它具有三种形式:
[{ "key" : "buy", "value" : 286698 }]
[{ "value" : 15311500, "key" : "buy_free" }, { "value" : 0, "key" : "buy_limited" }]
[{ "key" : ""buy_free" " }, { "key" : "buy_limited" }]
因此,当我使用Scala定义 StructType 时,它并不适合每个文档,我只能将“ brasidas ”作为一个单独的列,但无法对其进行划分通过“钥匙”。这就是我得到的:
initNumber repurchase brasidas stockName type date
50000 50000 [{ "key" : "buy", "value" : 286698 }] shgf SSE 2015/3/30
这是获取mongodb文档的代码:
val readpledge =ReadConfig(Map("uri"-> (mongouri_beehive+".pledge")))
val pledge = getMongoDB.readCollection(sc, readpledge,"initNumber","repurchase","brasidas","stockName","type","date")
.selectExpr("cast(initNumber as int) initNumber", "cast(repurchase as int) repurchase","brasidas","stockName","type","date")
答案 0 :(得分:0)
如果您尝试df.printSchema()
,您可能会发现brasidas
得到了ArrayType
。最有可能(地图数组)。
因此,我建议实现某种将数组作为参数并以您需要的方式对其进行转换的UDF函数。
def arrayProcess(arr: Seq[AnyRef]): Seq[AnyRef] = ???