计算RDD [Array [(String,Int)]]和RDD [(String,Double)]的乘积

时间:2017-10-29 15:57:45

标签: scala apache-spark

我有$mgClient = new Mailgun('key-xxxxx'); $domain = "mg.xxxxx.com"; $result = $mgClient->sendMessage($domain, array(/* ... */), array( 'attachment' => array( // array entry for each file: array( // filePath: path to the actual file 'filePath' => $_FILES["resume"]["tmp_name"], // remoteName: user-visible attachment name 'remoteName' => $_FILES["resume"]["name"] ) ) ));

RDD[Array[(String, Int)]]

Array(Array((yellow,1), (green,1), (orange,1), (red,1)), Array((banana,1), (orange,1), (green,1), (apple,2), (kiwi,1), (pear,1), (red,1)), Array((salad,1), (potato,1), (carrot,1), (green,1), (leek,1)))

RDD[(String, Double)]

我希望通过将每个单词的值乘以第二个RDD中相同单词的值,从第一个RDD元素映射Array((pear,1.0986122886681098), (orange,0.0), (kiwi,1.0986122886681098), (apple,0.0), (yellow,1.0986122886681098), (banana,1.0986122886681098), (green,0.0), (carrot,1.0986122886681098), (leek,1.0986122886681098), (salad,1.0986122886681098), (red,0.0), (potato,1.0986122886681098))

结果应该是这样的:

RDD[Array[(String, Double)]]

1 个答案:

答案 0 :(得分:0)

由于你要对第一类数组的元素进行实际的并行计算,而不是在数组本身上,我认为parallelize元素(单词和值的数组)和将它们作为驱动程序集合。而不是阵列的RDD,你将有阵列的RDD。它会在以后派上用场。

val setOneRaw = 
    Array( 
        Array( 
            ( "kiwi", 1 ), 
            ( "green", 1 ), 
            ( "orange", 1 ), 
            ( "red", 1 ) 
        ), 
        Array( 
            ( "banana", 1 ), 
            ( "orange", 1 ), 
            ( "green", 1 )
        ), 
        Array( 
            ( "kiwi", 1 ), 
            ( "pear", 1 ), 
            ( "carrot", 1 )
        )
    )

val setOneRDDs = 
    setOneRaw
    .map( sc.parallelize( _ ) )

如果这样做,第二个RDD将与主要集合中的其他RDD具有相同的类型。

val setTwo =
    sc.parallelize(
        Array(
            ( "pear", 1.0986122886681098 ), 
            ( "orange", 0.0 ), 
            ( "kiwi", 1.0986122886681098 ), 
            ( "apple", 0.0 ) 
        )
    )

通过将两个需要处理的元素都作为RDD处理,您可以join它们,然后乘以连接产生的元组结果。

val mixedSet = 
    setOneRDDs
    .map( _.leftOuterJoin( setTwo ) )
    .map( 
        _.map( 
            ( row ) => ( row._1, row._2._1 * row._2._2.getOrElse( 1.0 ) ) 
        ) 
    )

通过使用leftOuterJoin,您将解决第一个集合具有给定单词的值但第二个集合没有的情况。

示例中给出的数据结果

(orange,1)
(green,1)
(red,1)
(kiwi,1)
(banana,1)
(orange,1)
(green,1)
(carrot,1)
(pear,1)
(kiwi,1)
-=-
(green,1.0)
(orange,0.0)
(red,1.0)
(kiwi,1.0986122886681098)
(green,1.0)
(banana,1.0)
(orange,0.0)
(kiwi,1.0986122886681098)
(carrot,1.0)
(pear,1.0986122886681098)

像“红色”或“香蕉”这样的词的值应按原样保留。