我在spark应用程序中具有以下代码
val log_file_df = loadInputFile() /*loads log file as Dataframe */
val split_df: DataFrame = splitLineByDelimiter(log_file_df) /* applies filter
function to input file */
val bad_data_df: DataFrame = parseAndSaveBadData(split_df) /*filters split dataframe into bad data */
val good_data_df = split_df.except(bad_data_df) /* separates good data from bad data */
如果我在split_df,bad_data_df上执行诸如show()之类的任何操作,其执行时间将减少(大约1.5分钟),并且我检查了物理计划输入日志文件是否只读一次
但是如果我对好的数据执行任何操作,则需要花费更多的时间。(4分钟)
val good_data_df = split_df.except(bad_data_df).show()
从物理计划输入日志文件中读取两次。我尝试了以下选项
split_df.cache() or split_df.createOrReplaceTempView("split_dfTable")
// Init.getSparkSession.sqlContext.cacheTable("split_dfTable")
val splitbc = Init.getSparkSession.sparkContext.broadcast(split_df)
但是执行时间没有改善,实际计划是相同的。 这是实际的计划。我应该如何改善我的代码?我的good_data_df进行了进一步的转换,并与其他一些花费更多时间的数据框结合在一起。
good_data_df.show(false)good_data_df.explain(true)
+- Exchange hashpartitioning(hostname#16, date#17, path#18, status#19,
content_size#20, 200)
+- *HashAggregate(keys=[hostname#16, date#17, path#18, status#19,
content_size#20], functions=[], output=[hostname#16, date#17, path#18,
status#19, content_size#20])
+- SortMergeJoin [coalesce(hostname#16, ), coalesce(date#17, ),
coalesce(path#18, ), coalesce(status#19, ),
coalesce(content_size#20, )],
[coalesce(hostname#49, ), coalesce(date#50, ), coalesce(path#51, ),
coalesce(status#52, ), coalesce(content_size#53, )], LeftAnti,
(((((hostname#16 <=> hostname#49) && (date#17 <=> date#50)) && (path#18 <=>
path#51)) && (status#19 <=> status#52)) && (content_size#20 <=>
content_size#53))
:- *Sort [coalesce(hostname#16, ) ASC NULLS FIRST, coalesce(date#17, ) ASC
NULLS FIRST, coalesce(path#18, ) ASC NULLS FIRST, coalesce(status#19, ) ASC
NULLS FIRST, coalesce(content_size#20, ) ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(coalesce(hostname#16, ), coalesce(date#17, ),
coalesce(path#18, ), coalesce(status#19, ), coalesce(content_size#20, ), 200)
: +- *Project [regexp_extract(val#13, ^([^\s]+\s), 1) AS hostname#16,
regexp_extract(val#13, ^.*(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4}), 1) AS
date#17, regexp_extract(val#13, ^.*"\w+\s+([^\s]+)\s*[(HTTP)]*.*", 1) AS
path#18, regexp_extract(val#13, ^.*"\s+([^\s]+), 1) AS status#19,
regexp_extract(val#13, ^.*\s+(\d+)$, 1) AS content_size#20]
: +- *FileScan csv [val#13] Batched: false, Format: CSV, Location:
InMemoryFileIndex[file:/C:/Users/M1047320/Desktop/access_log_Jul95],
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<val:string>
+- *Sort [coalesce(hostname#49, ) ASC NULLS FIRST, coalesce(date#50, ) ASC
NULLS FIRST, coalesce(path#51, ) ASC NULLS FIRST, coalesce(status#52, ) ASC
NULLS FIRST, coalesce(content_size#53, ) ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(coalesce(hostname#49, ),
coalesce(date#50, ), coalesce(path#51, ), coalesce(status#52, ),
coalesce(content_size#53, ), 200)
+- *Project [regexp_extract(val#13, ^([^\s]+\s), 1) AS hostname#49,
regexp_extract(val#13, ^.*(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4}), 1) AS
date#50, regexp_extract(val#13, ^.*"\w+\s+([^\s]+)\s*[(HTTP)]*.*", 1) AS
path#51, regexp_extract(val#13, ^.*"\s+([^\s]+), 1) AS status#52,
regexp_extract(val#13, ^.*\s+(\d+)$, 1) AS content_size#53]
+- *Filter ((((regexp_extract(val#13, ^.*"\w+\s+([^\s]+)\s*
[(HTTP)]*.*", 1) RLIKE .*(jpg|gif|png|xbm|jpeg|wav|mpg|pl)$ ||
(regexp_extract(val#13, ^([^\s]+\s), 1) = )) || (regexp_extract(val#13,
^.*"\w+\s+([^\s]+)\s*[(HTTP)]*.*", 1) = )) || (regexp_extract(val#13, ^.*
(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4}), 1) = )) ||
(regexp_extract(val#13, ^.*"\s+([^\s]+), 1) = ))
+- *FileScan csv [val#13] Batched: false, Format: CSV,
Location:
InMemoryFileIndex[file:/C:/Users/M1047320/Desktop/access_log_Jul95],
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<val:string>
答案 0 :(得分:1)
显示split_df,bad_data_df花费的时间明显更少的原因是, Spark将仅读取和解析该节目所需的行。您读入Spark的数据分为Partitions
,它们是将在工作线程之间划分的数据片段。
一旦在split_df上调用show,bad_data_df Spark将仅在数据的一小部分上工作(split_df上仅20行,bad_data_df上仅前20条不良行)。
另一方面,当在good_data_df上调用show时,Spark将必须处理所有数据(以读取所有数据,对其进行解析,以及从总)。
如果您有一种定义不良行的简单方法,建议您使用UDF-布尔值isBad添加另一列,并对其进行过滤。简单地传递数据比except
要简单得多。
答案 1 :(得分:0)
缓存不是一项操作。因此,如果仅在good_data_df.show()之前执行split_df.cache(),则DAG仅针对good_data_df.show创建,而不针对split_df.cache。 split_df的缓存将作为一个阶段执行,但是good_data_df将无法使用该缓存。要使good_data_df使用缓存的数据,只需在split_df.cache之后使用split_df.take(1),这实际上将使split_df缓存以供good_data_df使用。