用于偏斜数据的自定义分区程序

时间:2019-05-25 11:09:48

标签: scala apache-spark apache-spark-sql

我有一些用例的整个世界数据。每个国家/地区包含3至5种产品,每小时为每个用户收集一次。我想进行引导以计算每个国家每小时每个产品的一些平均比率和其他比率。

input.rdd.map(
      row => (
        (country, product, hour),
        (country, product, hour, user, rating)
        )
    )
val groups = keyGroup.groupByKey()
val output = groups.flatMapValues(x => bootstrap(x)).toDF

问题在于某些国家/地区的数据量很大,这导致整个过程耗时数小时,但仍未完成。我尝试获得的尺寸大致为:

Partition:count ->Countries

0: 2044816 -> India,Turkey

1: 1466790 -> Turkey,India

2: 783772 -> India,Mexico,Japan,South Korea

3: 431538 -> Japan,Mexico,South Korea,India,Indonesia,Turkey,Brazil,Russian Federation

4: 319824 -> South Korea,Brazil,Russian Federation,India,Mexico,United States of America,Turkey,Japan,Bangladesh

5: 268698 -> Bangladesh,Nigeria,Russian Federation,United States of America

6: 264709 -> Russian Federation,United States of America,Germany,Bangladesh,Nigeria,South Africa

7: 227612 -> South Africa,United States of America,Russian Federation,Brazil,South Korea,Germany
...
...
167: 58 -> Mexico,Chile,Uganda,Thailand,Ivory Coast,Antigua and Barbuda,Palau,Luxembourg,United States of America,British Virgin Islands,Iceland,Andorra,Samoa,Vanuatu,Botswana,Saint Lucia,Kiribati,Greenland

168: 69 -> Greenland,Iceland,Chile,Zambia,Estonia,Vanuatu,Cyprus,Malta,Saudi Arabia,Japan,Uruguay,Qatar,United States of America,Luxembourg,Peru,Belize,Papua New Guinea,Samoa,South Sudan

169: 61 -> Myanmar,Belize,Chile,Somalia,Bhutan,Luxembourg,Liberia,Norway,United Kingdom,Burkina Faso,Lithuania,Macedonia,Belgium,Vanuatu,Burundi,DR Congo,Montenegro,Central African Republic,Bosnia and Herzegovina

170: 36 -> Mauritania,Sierra Leone,Hungary,Zambia,Somalia,Federated States of Micronesia,Serbia,Liberia,Nepal,Chile,Israel,Ukraine,Montenegro,Yemen,Croatia,Central African Republic,Armenia,Andorra,United Arab Emirates,Mauritius,Albania,Lebanon,Macedonia

171: 25 -> Spain,Comoros,Libya,Peru,Latvia,Montenegro,Egypt,Malaysia,Central African Republic,Faroe Islands,Tanzania,Palau,Chad,Guatemala,Kiribati,Burundi,Luxembourg,Equatorial Guinea,Barbados,Belgium

172: 14 -> Vietnam,Tanzania,Hungary,Egypt,Comoros,Equatorial Guinea,Guinea-Bissau,Moldova,Macedonia,Guyana,Federated States of Micronesia,New Zealand,Chad

可以看出,数据没有被均匀地划分,并且具有 173 个分区。数据约为6 GB,其中包含一周的数据。如果我尝试通过做 1000分区来运行单个国家/地区,那么它会起作用,但在一起却不起作用。

我正在考虑编写一个自定义分区程序,但是我不知道如何为更大的国家/地区划分数据。如果有人可以帮助我,那就太好了。

1 个答案:

答案 0 :(得分:0)

在继续编写自己的自定义分区程序之前,可以尝试以下操作: 由于您已经知道哪些国家/地区歪斜了数据,因此您可以为所述国家/地区创建一个组合键,在某个范围内随机附加数字(对于更偏斜的数据,范围更大)。您可以在该键上进行聚合,然后放下组合键并进一步进行聚合。

df.withColumn("composite_key", 
    when(isSkewDataCountryUDF(col("country")), concat(col("country"), randomNumberSuffix())
    .otherwise(col("country")))
.groupBy("composite_key")
.count
.drop("composite_key")
.groupBy("country")
.count

还要尝试设置更高的spark.default.parallelismspark.sql.shuffle.partitions