Question

我正在学习火花，并希望寻求解决以下问题的最佳方法。

我有2个数据集users和transactions，如下所示，并希望加入它们以查找每件商品的唯一地点。

文件的标题如下

id,email,language,location ----------- USER HEADERS
txid,productid,userid,price,desc -------------------- TRANSACTION HEADERS

以下是我的方法

/*
         * Load user data set into userDataFrame
         * Load transaction data set into transactionDataFrame
         * join both on user id - userTransactionFrame
         * select productid and location columns from the joined dataset into a new dataframe - productIdLocationDataFrame
         * convert the new dataframe into a javardd - productIdLocationJavaRDD
         * make the javardd a pair rdd - productIdLocationJavaPairRDD
         * group the pair rdd by key - productLocationList
         * apply mapvalues on the grouped key to convert the list of values to a set of valued for duplicate filtering - productUniqLocations
         * 
         * */

我不是很确定我是以正确的方式做到了这一点，但仍然觉得＆＃34;可以做得更好，不同而且＃34;

我怀疑从JavaPairRDD完成重复过滤的部分。

请评估方法和代码，让我知道更好的解决方案。

代码

    SparkConf conf = new SparkConf();
    conf.setAppName("Sample App - Uniq Location per item");
    JavaSparkContext jsc = new JavaSparkContext("local[*]","A 1");
    //JavaSparkContext jsc = new JavaSparkContext(conf);
    SQLContext sqlContext = new SQLContext(jsc);

    //id    email   language    location ----------- USER HEADERS
    DataFrame userDataFrame = sqlContext.read()
            .format("com.databricks.spark.csv")
            .option("inferSchema", "true")
            .option("header", "true")
            .option("delimiter", "\t")
            .load("user");

    //txid  pid uid price   desc -------------------- TRANSACTION HEADERS
    DataFrame transactionDataFrame = sqlContext.read()
            .format("com.databricks.spark.csv")
            .option("inferSchema", "true")
            .option("header", "true")
            .option("delimiter", "\t")
            .load("transactions");

    Column joinColumn = userDataFrame.col("id").equalTo(transactionDataFrame.col("uid"));

    DataFrame userTransactionFrame = userDataFrame.join(transactionDataFrame,joinColumn,"rightouter");

    DataFrame productIdLocationDataFrame = userTransactionFrame.select(userTransactionFrame.col("pid"),userTransactionFrame.col("location"));

    JavaRDD<Row> productIdLocationJavaRDD = productIdLocationDataFrame.toJavaRDD();

    JavaPairRDD<String, String> productIdLocationJavaPairRDD = productIdLocationJavaRDD.mapToPair(new PairFunction<Row, String, String>() {

        public Tuple2<String, String> call(Row inRow) throws Exception {
            return new Tuple2(inRow.get(0),inRow.get(1));
        }
    });


    JavaPairRDD<String, Iterable<String>> productLocationList = productIdLocationJavaPairRDD.groupByKey();

    JavaPairRDD<String, Iterable<String>> productUniqLocations = productLocationList.mapValues(new Function<Iterable<String>, Iterable<String>>() {

        public Iterable<String> call(Iterable<String> inputValues) throws Exception {
            return new HashSet<String>((Collection<? extends String>) inputValues);
        }
    });

    productUniqLocations.saveAsTextFile("uniq");

好的部分是代码运行并生成我期望的输出。

Answer 1

最低挂果是groupByKey的{{3}}。

使用aggregateByKey应该完成工作，因为值的输出类型不同（我们希望每个键设置一次）。

Scala中的代码：

 pairRDD.aggregateByKey(new java.util.HashSet[String])
((locationSet, location) => {locationSet.add(location); locationSet},
 (locSet1, locSet2) => {locSet1.addAll(locSet2); locSet1}
)

Java等效：

Function2<HashSet<String>, String, HashSet<String>> sequenceFunction = new Function2<HashSet<String>, String, HashSet<String>>() {

            public HashSet<String> call(HashSet<String> aSet, String arg1) throws Exception {
                aSet.add(arg1);
                return aSet;
            }
        };

        Function2<HashSet<String>, HashSet<String>, HashSet<String>> combineFunc = new Function2<HashSet<String>, HashSet<String>, HashSet<String>>() {

            public HashSet<String> call(HashSet<String> arg0, HashSet<String> arg1) throws Exception {
                arg0.addAll(arg1);
                return arg0;
            }
        };

        JavaPairRDD<String, HashSet<String>> byKey = productIdLocationJavaPairRDD.aggregateByKey(new HashSet<String>(), sequenceFunction, combineFunc );

<小时/> 其次，当数据集被共同分区时，连接工作效果最好。

由于您正在处理数据帧，因此如果您使用Spark＆lt;数据帧，则开箱即区分为getting rid。 1.6。因此，您可能希望将数据读入RDD，对它们进行分区，然后创建数据帧。对于您的用例，最好不要涉及数据框架。

Java Spark JavaPairRDD中重复过滤的最佳方法

1 个答案: