Spark Core多列分组依据

时间:2019-04-23 20:30:16

标签: java apache-spark

我想通过Spark Core使用多个字段对RDD进行分组操作

到目前为止,我已经能够加入两个不同的RDD并将结果RDD按一列(日期)分组,但是我想对多个键/字段(国家/日期)进行按组分组。

我已经定义了JavaRDD<ProductSale>类型的RDD。

public class ProductSale implements Serializable {

    private static final long serialVersionUID = -4579808280658565853L;

    private String country;
    private String date;
    private Double price;

    public String getCountry() {
        return country;
    }

    public void setCity(String country) {
        this.country = country;
    }

    public String getDate() {
        return date;
    }

    public void setDate(String date) {
        this.date = date;
    }

    public Double getPrice() {
        return price;
    }

    public void setPrice(Double price) {
        this.price = price;
    }
}

实际数据

country |    date   | price

Japan   |2019-04-17 | 5000.0
USA     |2019-04-16 | 10000.0
Japan   |2019-04-17 | 3000.0
UK      |2019-04-15 | 4000.0

预期产量

country |   date    | price

Japan   |2019-04-17 | 8000.0
USA     |2019-04-16 | 10000.0
UK      |2019-04-15 | 4000.0

1 个答案:

答案 0 :(得分:0)

我尝试在我的解决方案中对多列执行 GroupBy 操作。在这种情况下,多列是 CountryDate

groupBy 方法的函数中,我们可以只传递一个 lambda 函数来定义 group by element 的值。我们也可以实现一个函数式接口并将其传递给这个 groupBy 语句,但我相信传递 lambda function 更具可读性。

正如问题中提到的,我定义了一个实体类,它与上面的代码块大体相似。细微的区别是我对 Getter 和构造函数使用了 Lombok Annotations。所以我的课看起来像这样。它被命名为ProductSaleEntity.java

import lombok.AllArgsConstructor;
import lombok.Getter;

import java.io.Serializable;

@Getter
@AllArgsConstructor
public class ProductSaleEntity implements Serializable {
    private final String country;
    private final String date;
    private final Double price;

    @Override
    public String toString() {
        return "ProductSaleEntity{" +
                "country='" + country + '\'' +
                ", date='" + date + '\'' +
                ", price=" + price +
                '}';
    }
}

现在我为 GroupBy 定义逻辑的类如下。它被命名为MultipleColumnsGroupBy.java

import lombok.extern.slf4j.Slf4j;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;

import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;

@Slf4j
public class MultipleColumnsGroupBy {
    public static void main(String[] args) {
        final String applicationName = MultipleColumnsGroupBy.class.getName();
        final SparkConf sparkConf = new SparkConf().setAppName(applicationName).setMaster("local");
        final JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
        javaSparkContext.setLogLevel("INFO");

        final ProductSaleEntity productSaleEntity1 = new ProductSaleEntity("Japan", "2019-04-17", 5000.0);
        final ProductSaleEntity productSaleEntity2 = new ProductSaleEntity("USA", "2019-04-16", 10000.0);
        final ProductSaleEntity productSaleEntity3 = new ProductSaleEntity("Japan", "2019-04-17", 3000.0);
        final ProductSaleEntity productSaleEntity4 = new ProductSaleEntity("UK", "2019-04-15", 4000.0);

        final List<ProductSaleEntity> productSaleEntityList = new ArrayList<>();
        productSaleEntityList.add(productSaleEntity1);
        productSaleEntityList.add(productSaleEntity2);
        productSaleEntityList.add(productSaleEntity3);
        productSaleEntityList.add(productSaleEntity4);

        final JavaRDD<ProductSaleEntity> entityJavaRDD = javaSparkContext.parallelize(productSaleEntityList, 1);

        // using groupBy approach
        final JavaPairRDD<String, Iterable<ProductSaleEntity>> countryDateToJavaPairEntityRDD =
                entityJavaRDD.groupBy(a -> a.getCountry() + "::" + a.getDate());

        final List<Tuple2<String, Iterable<ProductSaleEntity>>> countryDateToListOfEntity =
                countryDateToJavaPairEntityRDD.collect();
        log.info("countryDateToListOfEntity is = {} ", countryDateToListOfEntity);

        for (final Tuple2<String, Iterable<ProductSaleEntity>> item : countryDateToListOfEntity) {
            log.info("country and Date is = {} ", item._1);
            final Iterator<ProductSaleEntity> iterator = item._2.iterator();
            Double totalPrice = 0.0;

            while (iterator.hasNext()) {
                final Double currentPrice = iterator.next().getPrice();
                log.info("Current price is = {} ", currentPrice);
                totalPrice += currentPrice;
            }
            log.info("Total price is = {} ", totalPrice);
        }
        // map of what is the count of each unique combinations of Country and Date together
        final Map<String, Long> collectedMapCount = countryDateToListOfEntity.stream().
                collect(Collectors.toMap(p -> p._1, p -> p._2.spliterator().getExactSizeIfKnown()));
        log.info("collectedMapCount is = {} ", collectedMapCount.toString());
    }
}

关于上述代码的一些说明/要点:

  1. 我已将执行此火花代码的模式设置为 local
  2. 我已使用普通构造函数初始化对象 productSaleEntity1 并将其添加到列表 productSaleEntityList
  3. 我已使用 entityJavaRDD 方法将此 Java 列表转换为 parallelize,并将 numSlices 参数作为 1 传递。
  4. 在执行 Group by 方法后,我使用了 collect 操作并获取了 java 列表。
  5. 对列表进行简单迭代以获取价格元素并将其添加以查找国家和日期的每个唯一组合的总价格。
  6. 在最后阶段,我简单地计算了地图以找到国家和日期的每个唯一组合的计数。

上述输入数据的输出示例:

21/05/27 17:44:38 INFO MultipleColumnsGroupBy: countryDateToListOfEntity is = [(Japan::2019-04-17,[ProductSaleEntity{country='Japan', date='2019-04-17', price=5000.0}, ProductSaleEntity{country='Japan', date='2019-04-17', price=3000.0}]), (USA::2019-04-16,[ProductSaleEntity{country='USA', date='2019-04-16', price=10000.0}]), (UK::2019-04-15,[ProductSaleEntity{country='UK', date='2019-04-15', price=4000.0}])] 
21/05/27 17:44:38 INFO MultipleColumnsGroupBy: country and Date is = Japan::2019-04-17 
21/05/27 17:44:38 INFO MultipleColumnsGroupBy: Current price is = 5000.0 
21/05/27 17:44:38 INFO MultipleColumnsGroupBy: Current price is = 3000.0 
21/05/27 17:44:38 INFO MultipleColumnsGroupBy: Total price is = 8000.0 
21/05/27 17:44:38 INFO MultipleColumnsGroupBy: country and Date is = USA::2019-04-16 
21/05/27 17:44:38 INFO MultipleColumnsGroupBy: Current price is = 10000.0 
21/05/27 17:44:38 INFO MultipleColumnsGroupBy: Total price is = 10000.0 
21/05/27 17:44:38 INFO MultipleColumnsGroupBy: country and Date is = UK::2019-04-15 
21/05/27 17:44:38 INFO MultipleColumnsGroupBy: Current price is = 4000.0 
21/05/27 17:44:38 INFO MultipleColumnsGroupBy: Total price is = 4000.0 
21/05/27 17:44:38 INFO MultipleColumnsGroupBy: collectedMapCount is = {Japan::2019-04-17=2, UK::2019-04-15=1, USA::2019-04-16=1}