在Spark中创建n个机器学习模型,并行化

时间:2018-11-02 21:44:54

标签: apache-spark pyspark apache-spark-mllib

因此,我们在第1列中有n个唯一值,在第2列中有m个唯一值的数据集。我们想训练n个简单的机器学习模型,如下所示:

const input = [
    {
        a : "test", b : "5", c : 4
    },
    {
        a : "test 2", b : "2", c : 10
    },
    {
        a : "test", b : "5", c : 66
    }
];

/**
 * Will sum all fields except the one specified in fields, which will group
 *
 * @param input    The input array
 * @param fields   The fields that needs to be grouped
 * @returns {*}
 */
function sumGroup( input, fields ) {
    return input.reduce( ( result, item ) => {

        /** Get all keys of the current item ... ["a","b","c"...] */
        const itemKeys = Object.keys( item );

        /** Get the grouped item that was already stored */
        let currentItem = result.find( element => {
            return fields.map( field => element[field] === item[field] )
                         .indexOf( false ) === -1;
        } );

        /** If there was no group item, we create one and add it */
        if ( !currentItem ) {

            currentItem = itemKeys.filter( key => fields.indexOf( key ) > -1 )
                                  .reduce( ( obj, key ) => Object.assign( obj, {
                                      [key] : item[key]
                                  } ), {} );

            result.push( currentItem );

        }

        /**
         * Finally we sum all other keys and add them to the already referenced
         * current item
         */
        itemKeys.filter( key => fields.indexOf( key ) === -1 )
                .forEach( key => {
                    if ( !currentItem[key] ) { currentItem[key] = 0; }
                    currentItem[key] = currentItem[key] + item[key];
                } );

        return result;

    }, [] );
}

console.log( sumGroup( input, ["a", "b"] ) );

我打算制作一个函数,以便基本上映射一个使用group by来满足我们目标的数据框,而学习就是聚合函数。

for each z in unique(col1):
    X = col2 in data where col1 == z
    Y = col2 in data where col1 != z
    yield SVC.fit(X, Y)

我知道这并不完全正确,这只是id的示例。

我的主要问题是,我不希望这些操作可以迭代完成。我想同时发布所有作业,并相信它已正确并行化。

将sparkml作业嵌入聚合函数中会产生适当的并行化,还是会混淆调度程序?

0 个答案:

没有答案