因此,我们在第1列中有n个唯一值,在第2列中有m个唯一值的数据集。我们想训练n个简单的机器学习模型,如下所示:
const input = [
{
a : "test", b : "5", c : 4
},
{
a : "test 2", b : "2", c : 10
},
{
a : "test", b : "5", c : 66
}
];
/**
* Will sum all fields except the one specified in fields, which will group
*
* @param input The input array
* @param fields The fields that needs to be grouped
* @returns {*}
*/
function sumGroup( input, fields ) {
return input.reduce( ( result, item ) => {
/** Get all keys of the current item ... ["a","b","c"...] */
const itemKeys = Object.keys( item );
/** Get the grouped item that was already stored */
let currentItem = result.find( element => {
return fields.map( field => element[field] === item[field] )
.indexOf( false ) === -1;
} );
/** If there was no group item, we create one and add it */
if ( !currentItem ) {
currentItem = itemKeys.filter( key => fields.indexOf( key ) > -1 )
.reduce( ( obj, key ) => Object.assign( obj, {
[key] : item[key]
} ), {} );
result.push( currentItem );
}
/**
* Finally we sum all other keys and add them to the already referenced
* current item
*/
itemKeys.filter( key => fields.indexOf( key ) === -1 )
.forEach( key => {
if ( !currentItem[key] ) { currentItem[key] = 0; }
currentItem[key] = currentItem[key] + item[key];
} );
return result;
}, [] );
}
console.log( sumGroup( input, ["a", "b"] ) );
我打算制作一个函数,以便基本上映射一个使用group by来满足我们目标的数据框,而学习就是聚合函数。
for each z in unique(col1):
X = col2 in data where col1 == z
Y = col2 in data where col1 != z
yield SVC.fit(X, Y)
我知道这并不完全正确,这只是id的示例。
我的主要问题是,我不希望这些操作可以迭代完成。我想同时发布所有作业,并相信它已正确并行化。
将sparkml作业嵌入聚合函数中会产生适当的并行化,还是会混淆调度程序?