当我在pyspark数据帧上执行orderBy时,是否对所有分区(即整个结果)中的数据进行排序?还是在分区级别进行排序? 如果是稍后的版本,那么有人可以建议如何对数据执行orderBy吗? 我最后有一个orderBy
我当前的代码:
sub pwerr {
local $_=pop;
my $s='!#+,-./:=@_'; #allowed special chars
grep $_,
/^[a-z\d$s]+$/i ? 0 : "Password must be just nums, letters and special chars $s",
length()>=10 ? 0 : "Minimum length of the password is 10",
length()<=32 ? 0 : "Maximum length of the password is 32",
!/(.)\1{9}/ ? 0 : "Same char 10 or more in a row",
/^[a-zA-Z0-9]/ ? 0 : "First character can not be special character",
1</[a-z]/i+/\d/+/[$s]/ ? 0 : "At least 2 char classes of letters, numbers or special $s";
}
use strict; use warnings; use Test::More tests => 7;
sub _test_string { join("+",map{/^(\S+)/;$1}pwerr(shift()))||undef }
is(_test_string($$_[0]), $$_[1]) for map[split],grep/\w/,split/\n/,q(
1A!~ Password+Minimum
abc Minimum+At
abcd12345-
abcd12345.
-abcd12345 First
abcd4444444444 Same
abcd12345.abcd12345.abcd12345.xyz Maximum
);
当我执行df.explain()时,得到以下内容-
def extract_work(self, days_to_extract):
source_folders = self.work_folder_provider.get_work_folders(s3_source_folder=self.work_source,
warehouse_ids=self.warehouse_ids,
days_to_extract=days_to_extract)
source_df = self._load_from_s3(source_folders)
# Partition and de-dupe the data-frame retaining latest
source_df = self.data_frame_manager.partition_and_dedupe_data_frame(source_df,
partition_columns=['binScannableId', 'warehouseId'],
sort_key='cameraCaptureTimestampUtc',
desc=True)
# Filter out anything that does not qualify for virtual count.
source_df = self._virtual_count_filter(source_df)
history_folders = self.work_folder_provider.get_history_folders(s3_history_folder=self.history_source,
days_to_extract=days_to_extract)
history_df = self._load_from_s3(history_folders)
# Filter out historical items
if history_df:
source_df = source_df.join(history_df, 'binScannableId', 'leftanti')
else:
self.logger.error("No History was found")
# Sort by defectProbability
source_df = source_df.orderBy(desc('defectProbability'))
return source_df
def partition_and_dedupe_data_frame(data_frame, partition_columns, sort_key, desc):
if desc:
window = Window.partitionBy(partition_columns).orderBy(F.desc(sort_key))
else:
window = Window.partitionBy(partition_columns).orderBy(F.asc(sort_key))
data_frame = data_frame.withColumn('rank', F.rank().over(window)).filter(F.col('rank') == 1).drop('rank')
return data_frame
def _virtual_count_filter(self, source_df):
df = self._create_data_frame()
for key in self.virtual_count_thresholds.keys():
temp_df = source_df.filter((source_df['expectedQuantity'] == key) & (source_df['defectProbability'] > self.virtual_count_thresholds[key]))
df = df.union(temp_df)
return df
答案 0 :(得分:2)
orderBy()
是一个“ 范围内的转换”,这意味着Spark需要触发“ 随机播放”和“ 阶段分割(1个分区分为多个输出分区)”,从而检索分布在整个群集中的所有分区拆分,以在此处执行orderBy()
。
如果您查看解释计划,它将具有一个重新分区指示器,该指示器具有默认的200个输出分区( spark.sql.shuffle.partitions 配置),这些分区已编写执行后保存到磁盘。这告诉您在执行Spark“ action ”时会发生“ 广泛转换”,也就是“ 随机播放”。
其他“ 广泛的转化”包括:distinct(), groupBy(), and join() => *sometimes*
from pyspark.sql.functions import desc
df = spark.range(10).orderBy(desc("id"))
df.show()
df.explain()
+---+
| id|
+---+
| 9|
| 8|
| 7|
| 6|
| 5|
| 4|
| 3|
| 2|
| 1|
| 0|
+---+
== Physical Plan ==
*(2) Sort [id#6L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(id#6L DESC NULLS LAST, 200)
+- *(1) Range (0, 10, step=1, splits=8)