我想通过Spark Core使用多个字段对RDD进行分组操作
到目前为止,我已经能够加入两个不同的RDD并将结果RDD按一列(日期)分组,但是我想对多个键/字段(国家/日期)进行按组分组。
我已经定义了JavaRDD<ProductSale>
类型的RDD。
public class ProductSale implements Serializable {
private static final long serialVersionUID = -4579808280658565853L;
private String country;
private String date;
private Double price;
public String getCountry() {
return country;
}
public void setCity(String country) {
this.country = country;
}
public String getDate() {
return date;
}
public void setDate(String date) {
this.date = date;
}
public Double getPrice() {
return price;
}
public void setPrice(Double price) {
this.price = price;
}
}
实际数据
country | date | price
Japan |2019-04-17 | 5000.0
USA |2019-04-16 | 10000.0
Japan |2019-04-17 | 3000.0
UK |2019-04-15 | 4000.0
预期产量
country | date | price
Japan |2019-04-17 | 8000.0
USA |2019-04-16 | 10000.0
UK |2019-04-15 | 4000.0
答案 0 :(得分:0)
我尝试在我的解决方案中对多列执行 GroupBy
操作。在这种情况下,多列是 Country
和 Date
。
在 groupBy
方法的函数中,我们可以只传递一个 lambda 函数来定义 group by element 的值。我们也可以实现一个函数式接口并将其传递给这个 groupBy 语句,但我相信传递 lambda function
更具可读性。
正如问题中提到的,我定义了一个实体类,它与上面的代码块大体相似。细微的区别是我对 Getter 和构造函数使用了 Lombok Annotations
。所以我的课看起来像这样。它被命名为ProductSaleEntity.java
import lombok.AllArgsConstructor;
import lombok.Getter;
import java.io.Serializable;
@Getter
@AllArgsConstructor
public class ProductSaleEntity implements Serializable {
private final String country;
private final String date;
private final Double price;
@Override
public String toString() {
return "ProductSaleEntity{" +
"country='" + country + '\'' +
", date='" + date + '\'' +
", price=" + price +
'}';
}
}
现在我为 GroupBy 定义逻辑的类如下。它被命名为MultipleColumnsGroupBy.java
import lombok.extern.slf4j.Slf4j;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
@Slf4j
public class MultipleColumnsGroupBy {
public static void main(String[] args) {
final String applicationName = MultipleColumnsGroupBy.class.getName();
final SparkConf sparkConf = new SparkConf().setAppName(applicationName).setMaster("local");
final JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
javaSparkContext.setLogLevel("INFO");
final ProductSaleEntity productSaleEntity1 = new ProductSaleEntity("Japan", "2019-04-17", 5000.0);
final ProductSaleEntity productSaleEntity2 = new ProductSaleEntity("USA", "2019-04-16", 10000.0);
final ProductSaleEntity productSaleEntity3 = new ProductSaleEntity("Japan", "2019-04-17", 3000.0);
final ProductSaleEntity productSaleEntity4 = new ProductSaleEntity("UK", "2019-04-15", 4000.0);
final List<ProductSaleEntity> productSaleEntityList = new ArrayList<>();
productSaleEntityList.add(productSaleEntity1);
productSaleEntityList.add(productSaleEntity2);
productSaleEntityList.add(productSaleEntity3);
productSaleEntityList.add(productSaleEntity4);
final JavaRDD<ProductSaleEntity> entityJavaRDD = javaSparkContext.parallelize(productSaleEntityList, 1);
// using groupBy approach
final JavaPairRDD<String, Iterable<ProductSaleEntity>> countryDateToJavaPairEntityRDD =
entityJavaRDD.groupBy(a -> a.getCountry() + "::" + a.getDate());
final List<Tuple2<String, Iterable<ProductSaleEntity>>> countryDateToListOfEntity =
countryDateToJavaPairEntityRDD.collect();
log.info("countryDateToListOfEntity is = {} ", countryDateToListOfEntity);
for (final Tuple2<String, Iterable<ProductSaleEntity>> item : countryDateToListOfEntity) {
log.info("country and Date is = {} ", item._1);
final Iterator<ProductSaleEntity> iterator = item._2.iterator();
Double totalPrice = 0.0;
while (iterator.hasNext()) {
final Double currentPrice = iterator.next().getPrice();
log.info("Current price is = {} ", currentPrice);
totalPrice += currentPrice;
}
log.info("Total price is = {} ", totalPrice);
}
// map of what is the count of each unique combinations of Country and Date together
final Map<String, Long> collectedMapCount = countryDateToListOfEntity.stream().
collect(Collectors.toMap(p -> p._1, p -> p._2.spliterator().getExactSizeIfKnown()));
log.info("collectedMapCount is = {} ", collectedMapCount.toString());
}
}
关于上述代码的一些说明/要点:
local
。productSaleEntity1
并将其添加到列表 productSaleEntityList
。entityJavaRDD
方法将此 Java 列表转换为 parallelize
,并将 numSlices
参数作为 1 传递。上述输入数据的输出示例:
21/05/27 17:44:38 INFO MultipleColumnsGroupBy: countryDateToListOfEntity is = [(Japan::2019-04-17,[ProductSaleEntity{country='Japan', date='2019-04-17', price=5000.0}, ProductSaleEntity{country='Japan', date='2019-04-17', price=3000.0}]), (USA::2019-04-16,[ProductSaleEntity{country='USA', date='2019-04-16', price=10000.0}]), (UK::2019-04-15,[ProductSaleEntity{country='UK', date='2019-04-15', price=4000.0}])]
21/05/27 17:44:38 INFO MultipleColumnsGroupBy: country and Date is = Japan::2019-04-17
21/05/27 17:44:38 INFO MultipleColumnsGroupBy: Current price is = 5000.0
21/05/27 17:44:38 INFO MultipleColumnsGroupBy: Current price is = 3000.0
21/05/27 17:44:38 INFO MultipleColumnsGroupBy: Total price is = 8000.0
21/05/27 17:44:38 INFO MultipleColumnsGroupBy: country and Date is = USA::2019-04-16
21/05/27 17:44:38 INFO MultipleColumnsGroupBy: Current price is = 10000.0
21/05/27 17:44:38 INFO MultipleColumnsGroupBy: Total price is = 10000.0
21/05/27 17:44:38 INFO MultipleColumnsGroupBy: country and Date is = UK::2019-04-15
21/05/27 17:44:38 INFO MultipleColumnsGroupBy: Current price is = 4000.0
21/05/27 17:44:38 INFO MultipleColumnsGroupBy: Total price is = 4000.0
21/05/27 17:44:38 INFO MultipleColumnsGroupBy: collectedMapCount is = {Japan::2019-04-17=2, UK::2019-04-15=1, USA::2019-04-16=1}