如何将嵌套的mongodb表转换为spark数据框

时间:2018-07-04 11:37:46

标签: mongodb scala apache-spark apache-spark-sql

我有一个嵌套的mongodb talbe,其文档结构如下:

{
    "_id" : "35228334dbd1090f6117c5a0011b56b0",
    "brasidas" : [ 
        {
            "key" : "buy",
            "value" : 859193
        }
    ],
    "crawl_time" : NumberLong(1526296211997),
    "date" : "2018-05-11",
    "id" : "44874f4c8c677087bcd5f829b2843e66",
    "initNumber" : 0,
    "repurchase" : 0,
    "source_url" : "http://query.sse.com.cn/commonQuery.do?jsonCallBack=jQuery11120015170331124618408_1526262411932&isPagination=true&sqlId=COMMON_SSE_SCSJ_CJGK_ZQZYSHG_JYSLMX_L&beginDate&endDate&securityCode&pageHelp.pageNo=1&pageHelp.beginPage=1&pageHelp.cacheSize=1&pageHelp.endPage=1&pageHelp.pageSize=25",
    "stockCode" : "600020",
    "stockName" : "ZYGS",
    "type" : "SSE"
}
我想将其转换为spark数据框,并分别提取“ brasidas ”的标题“ key”和“ value”作为单列。如下所示:

initNumber  repurchase  key  value   stockName    type    date
    50000      50000    buy  286698    shgf       SSE   2015/3/30  

但是标题“ brasidas”的形式存在问题,它具有三种形式:

  [{ "key" : "buy", "value" : 286698 }] 

  [{ "value" : 15311500, "key" : "buy_free" }, { "value" : 0, "key" : "buy_limited" }]

  [{ "key" :    ""buy_free" " }, { "key" : "buy_limited" }]

因此,当我使用Scala定义 StructType 时,它并不适合每个文档,我只能将“ brasidas ”作为一个单独的列,但无法对其进行划分通过“钥匙”。这就是我得到的:

 initNumber   repurchase     brasidas     stockName    type   date
  50000        50000    [{ "key" : "buy", "value" : 286698 }]   shgf    SSE 2015/3/30

这是获取mongodb文档的代码:

val readpledge =ReadConfig(Map("uri"-> (mongouri_beehive+".pledge")))    
val pledge = getMongoDB.readCollection(sc, readpledge,"initNumber","repurchase","brasidas","stockName","type","date")
                   .selectExpr("cast(initNumber as int) initNumber", "cast(repurchase as int) repurchase","brasidas","stockName","type","date")

1 个答案:

答案 0 :(得分:0)

如果您尝试df.printSchema(),您可能会发现brasidas得到了ArrayType。最有可能(地图数组)。 因此,我建议实现某种将数组作为参数并以您需要的方式对其进行转换的UDF函数。

def arrayProcess(arr: Seq[AnyRef]): Seq[AnyRef] = ???