Pyspark数据框OrderBy是分区级别还是整体级别?

时间:2019-04-26 03:22:13

标签: python apache-spark pyspark pyspark-sql

当我在pyspark数据帧上执行orderBy时,是否对所有分区(即整个结果)中的数据进行排序?还是在分区级别进行排序? 如果是稍后的版本,那么有人可以建议如何对数据执行orderBy吗? 我最后有一个orderBy

我当前的代码:

sub pwerr {
  local $_=pop;
  my $s='!#+,-./:=@_'; #allowed special chars
  grep $_,
  /^[a-z\d$s]+$/i        ? 0 : "Password must be just nums, letters and special chars $s",
  length()>=10           ? 0 : "Minimum length of the password is 10",
  length()<=32           ? 0 : "Maximum length of the password is 32",
  !/(.)\1{9}/            ? 0 : "Same char 10 or more in a row",
  /^[a-zA-Z0-9]/         ? 0 : "First character can not be special character",
  1</[a-z]/i+/\d/+/[$s]/ ? 0 : "At least 2 char classes of letters, numbers or special $s";
}

use strict; use warnings; use Test::More tests => 7;
sub _test_string { join("+",map{/^(\S+)/;$1}pwerr(shift()))||undef }
is(_test_string($$_[0]), $$_[1]) for map[split],grep/\w/,split/\n/,q(
1A!~                               Password+Minimum
abc                                Minimum+At
abcd12345-
abcd12345.
-abcd12345                         First
abcd4444444444                     Same
abcd12345.abcd12345.abcd12345.xyz  Maximum
);

当我执行df.explain()时,得到以下内容-

def extract_work(self, days_to_extract):

        source_folders = self.work_folder_provider.get_work_folders(s3_source_folder=self.work_source,
                                                                    warehouse_ids=self.warehouse_ids,
                                                                    days_to_extract=days_to_extract)
        source_df = self._load_from_s3(source_folders)

        # Partition and de-dupe the data-frame retaining latest
        source_df = self.data_frame_manager.partition_and_dedupe_data_frame(source_df,
                                                                            partition_columns=['binScannableId', 'warehouseId'],
                                                                            sort_key='cameraCaptureTimestampUtc',
                                                                            desc=True)
        # Filter out anything that does not qualify for virtual count.
        source_df = self._virtual_count_filter(source_df)

        history_folders = self.work_folder_provider.get_history_folders(s3_history_folder=self.history_source,
                                                                        days_to_extract=days_to_extract)
        history_df = self._load_from_s3(history_folders)

        # Filter out historical items
        if history_df:
            source_df = source_df.join(history_df, 'binScannableId', 'leftanti')
        else:
            self.logger.error("No History was found")

        # Sort by defectProbability
        source_df = source_df.orderBy(desc('defectProbability'))

        return source_df

def partition_and_dedupe_data_frame(data_frame, partition_columns, sort_key, desc): 
          if desc: 
            window = Window.partitionBy(partition_columns).orderBy(F.desc(sort_key)) 
          else: 
            window = Window.partitionBy(partition_columns).orderBy(F.asc(sort_key)) 

          data_frame = data_frame.withColumn('rank', F.rank().over(window)).filter(F.col('rank') == 1).drop('rank') 
          return data_frame

def _virtual_count_filter(self, source_df):
        df = self._create_data_frame()
        for key in self.virtual_count_thresholds.keys():
            temp_df = source_df.filter((source_df['expectedQuantity'] == key) & (source_df['defectProbability'] > self.virtual_count_thresholds[key]))
            df = df.union(temp_df)
        return df

1 个答案:

答案 0 :(得分:2)

orderBy()是一个“ 范围内的转换”,这意味着Spark需要触发“ 随机播放”和“ 阶段分割(1个分区分为多个输出分区)”,从而检索分布在整个群集中的所有分区拆分,以在此处执行orderBy()

如果您查看解释计划,它将具有一个重新分区指示器,该指示器具有默认的200个输出分区( spark.sql.shuffle.partitions 配置),这些分区已编写执行后保存到磁盘。这告诉您在执行Spark“ action ”时会发生“ 广泛转换”,也就是“ 随机播放”。

其他“ 广泛的转化”包括:distinct(), groupBy(), and join() => *sometimes*

from pyspark.sql.functions import desc
df = spark.range(10).orderBy(desc("id"))
df.show()
df.explain()

+---+
| id|
+---+
|  9|
|  8|
|  7|
|  6|
|  5|
|  4|
|  3|
|  2|
|  1|
|  0|
+---+

== Physical Plan ==
*(2) Sort [id#6L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(id#6L DESC NULLS LAST, 200)
   +- *(1) Range (0, 10, step=1, splits=8)