如何在spark Java中映射两个数据集

时间:2017-07-11 11:12:36

标签: java apache-spark apache-spark-sql apache-spark-mllib apache-spark-ml

您好我从mongodb读取数据到spark应用程序。 我的mongodb包含2个收藏。 一个是profile_data(带字段名称的实际数据) (其中包含所有输入数据,包括一些唯一字段)

{
    "MessageStatus" : 2,
    "Origin" : 1,
    "_id" : ObjectId("596340fe8b0fa35d2880db1a"),
    "accerlation" : 19.4,
    "cylinders" : 4,
    "displacement" : 119,
    "file_id" : ObjectId("59633e48b760e7c8071a6c1c"),
    "horsepower" : 82,
    "modelyear" : 82,
    "modified_date" : ISODate("2017-07-10T08:47:01.641Z"),
    "mpg" : 31,
    "snet_id" : "new_project",
    "unique_id" : "784",
    "username" : "chevy s-10",
    "weight" : 2720
}

And another collection is : predictive_model_details(Which holds the ML model details like model name, feature fields and prediction field just like metadata)

{
    "_id" : ObjectId("56b4351be4b064bb19a90324"),
    "algorithm_id" : "55d717a53d9e22022ff2a1e9",
    "algorithm_name" : "K- Nearest Neighbours (IBK)",
    "client_id" : "562e1d51b760d0e408151b91",
    "feature_fields" : [ 
        {
            "name" : "Origin",
            "type" : "int"
        }, 
        {
            "name" : "accerlation",
            "type" : "Double"
        }, 
        {
            "name" : "displacement",
            "type" : "Int"
        }, 
        {
            "name" : "horsepower",
            "type" : "Int"
        }, 
        {
            "name" : "modelyear",
            "type" : "Int"
        }
    ],
    ,
    "makeActiveStatus" : "0",
    "model_name" : "test1",
    "parameter_type" : "system_defined",
    "parameters" : [ 
        {
            "symbol" : "-K",
            "value" : "1"
        }
    ],
    "predictor" : {
        "name" : "mpg"
        "type" : "Int"
    },
    "result_exists" : true,
    "snet_id" : "new_project"
}

所以我在MongoDB的两个集合中为spark创建了2个数据集。现在我想将这两个数据集与所有要素字段和预测字段一起映射。 2个数据集中的公共字段为snet_id

有人可以帮忙吗?

0 个答案:

没有答案