Question

我有一个数据集列表，我想用一个对所有数据集都通用的特定键进行分区，然后运行与所有分区数据集相同的联接/分组。

我正在尝试设计算法，使我使用Spark的partitionBy通过特定键创建分区。

现在，一种方法是在循环中在每个分区上运行操作，但这并不高效。

我想看看是否有手动分区的数据，我可以在这些数据集中并行运行操作吗？

我刚刚开始学习Spark，因此请原谅我。

在不同的数据集中考虑一个客户ID及其行为数据（例如浏览/点击等）的数据集。说一个浏览，点击另一个。首先，我正在考虑按客户ID对数据进行分区，然后针对每个分区（客户），加入一些属性，例如浏览器或设备，以查看每个客户的行为。所以基本上，它就像是嵌套的并行化。

在Spark中甚至有可能吗？有什么明显的我想念的东西吗？我可以参考一些文档吗？

Answer 1

尝试一下-

1. Create test dataset (Totol Record = 70000+) to perform parallel operation on each 

scala> ds.count
res137: Long = 70008

scala> ds.columns
res124: Array[String] = Array(awards, country)

2. Assume partition column as "country".

scala> ds.select("country").distinct.show(false)
+-------+
|country|
+-------+
|CANADA |
|CHINA  |
|USA    |
|EUROPE |
|UK     |
|RUSSIA |
|INDIA  |
+-------+

3. Get sum of records for each country [ **Without parallel process for each partition**]

scala> val countries = ds.select("country").distinct.collect
countries: Array[org.apache.spark.sql.Row] = Array([CANADA], [CHINA], [USA], [EUROPE], [UK], [RUSSIA], [INDIA])

scala> val startTime = System.currentTimeMillis()
startTime: Long = 1562047887130

scala> countries.foreach(country => ds.filter(ds("country") === country(0)).groupBy("country").count.show(false))
+-------+-----+
|country|count|
+-------+-----+
|CANADA |10001|
+-------+-----+

+-------+-----+
|country|count|
+-------+-----+
|CHINA  |10001|
+-------+-----+

+-------+-----+
|country|count|
+-------+-----+
|USA    |10001|
+-------+-----+

+-------+-----+
|country|count|
+-------+-----+
|EUROPE |10001|
+-------+-----+

+-------+-----+
|country|count|
+-------+-----+
|UK     |10002|
+-------+-----+

+-------+-----+
|country|count|
+-------+-----+
|RUSSIA |10001|
+-------+-----+

+-------+-----+
|country|count|
+-------+-----+
|INDIA  |10001|
+-------+-----+


scala> val endTime = System.currentTimeMillis()
endTime: Long = 1562047896088

scala> println(s"Total Execution Time :  ${(endTime - startTime) / 1000} Seconds")
Total Execution Time :  **8 Seconds**

4. Get sum of records for each country [ **With parallel process for each partition**]

scala> val startTime = System.currentTimeMillis()
startTime: Long = 1562048057431

scala> countries.par.foreach(country => ds.filter(ds("country") === country(0)).groupBy("country").count.show(false))

+-------+-----+
|country|count|
+-------+-----+
|INDIA  |10001|
+-------+-----+

+-------+-----+
|country|count|
+-------+-----+
|CANADA |10001|
+-------+-----+

+-------+-----+
|country|count|
+-------+-----+
|RUSSIA |10001|
+-------+-----+

+-------+-----+
|country|count|
+-------+-----+
|USA    |10001|
+-------+-----+

+-------+-----+
|country|count|
+-------+-----+
|UK     |10002|
+-------+-----+

+-------+-----+
|country|count|
+-------+-----+
|CHINA  |10001|
+-------+-----+

+-------+-----+
|country|count|
+-------+-----+
|EUROPE |10001|
+-------+-----+


scala> val endTime = System.currentTimeMillis()
endTime: Long = 1562048060273

scala> println(s"Total Execution Time :  ${(endTime - startTime) / 1000} Seconds")
Total Execution Time :  **2 Seconds**

结果：-

With    parallel process on each partition, it took ~ **2 Seconds**
Without parallel process on each partition, it took ~ **8 Seconds**

我经过测试可以检查每个国家/地区的记录数，您可以执行任何流程，例如写入配置单元表或hdfs文件等。

希望这很有帮助。

有没有一种方法可以对分区的spark数据集并行运行操作？

1 个答案: