使用pyspark交叉组合两个RDD

时间:2015-06-25 22:54:16

标签: lambda apache-spark rdd pyspark

如何交叉组合(这是正确的描述方式?)两个RDDS?

输入:

import java.io.File;
import java.io.FileNotFoundException;
import java.util.HashMap;
import java.util.Map;
import java.util.Scanner;


public class DigitsConverter{

    // this will have dictionary relation between digits and words
    private static final Map<Character,String> words = new HashMap<>();

    // can use String array instead of Map as suggested in comments
    private static final String[] alsoWords =  {"zero","one","two", "three", "four", "five", "six", "seven", "eight", "nine"}

    // provide mapping of digits to words
    static {

        words.put('0', "zero");
        words.put('1', "one");
        words.put('2', "two");
        words.put('3', "three");
        words.put('4', "four");
        words.put('5', "five");
        words.put('6', "six");
        words.put('7', "seven");
        words.put('8', "eight");
        words.put('9', "nine");
    }


       public static void main(String args[]) throws FileNotFoundException {

           Scanner scanner = new Scanner(new File("../SomeFile"));

           while (scanner.hasNextInt()) {

               char[] chars = ("" + scanner.nextInt()).toCharArray();

               System.out.print(String.valueOf(chars) +": ");

               // for each digit in a given number
               for (char digit: chars) {

                  // print word for that digit
                  System.out.print(words.get(digit) + " ");

                  // if String array is used instead of Map
                  System.out.print(alsoWords[((int)digit- 48)] + " ");

               }
               System.out.println();
           }

           scanner.close();

       }
    }

输出:

rdd1 = [a, b]
rdd2 = [c, d]

我尝试了rdd3 = [(a, c), (a, d), (b, c), (b, d)] ,它抱怨rdd3 = rdd1.flatMap(lambda x: rdd2.map(lambda y: (x, y))。我想这意味着你不能像列表推导那样嵌套It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation.,而且一个语句只能做一个action

2 个答案:

答案 0 :(得分:3)

因为您注意到您无法在另一个transformation内执行transformation(请注意flatMap&amp; maptransformations而非因为他们返回RDD而不是actions。值得庆幸的是,您尝试完成的工作直接受到Spark API中另一个转换的支持 - 即cartesian(请参阅http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD)。

所以你想做rdd1.cartesian(rdd2)

答案 1 :(得分:1)

您可以使用笛卡尔变换。 Here's文档中的一个示例:

{{1}}

在你的情况下,你会这样做 {{1}}