我正在尝试使用dataproc上的pyspark将数据帧内容写入谷歌云存储。尽管写入成功,但我在下面粘贴的日志中有很多警告消息。在创建集群或pyspark程序时,是否需要一些我缺少的设置?或者这是一些谷歌问题?
注意:数据框在Google存储空间上写入的数据是> 120 GB未压缩。但即使处理未压缩的1GB大小的数据,我也注意到了相同的警告。这是一个简单的数据帧,有50列可供读取,有些转换完成并写入磁盘。
Dataframe Write语句如下所示:
df.write.partitionBy("dt").format('csv').mode("overwrite").options(delimiter="|").save("gs://bucket/tbl/")
Pyspark日志中的警告声明:
18/04/01 19:58:28 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 182.0 in stage 3.0 (TID 68943, admg-tellrd-w-20.c.syw-analytics-repo-dev.internal, executor 219): org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Error closing the output.
at com.univocity.parsers.common.AbstractWriter.close(AbstractWriter.java:861)
at org.apache.spark.sql.execution.datasources.csv.UnivocityGenerator.close(UnivocityGenerator.scala:86)
at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.close(CSVFileFormat.scala:141)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.releaseResources(FileFormatWriter.scala:475)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$2.apply(FileFormatWriter.scala:450)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$2.apply(FileFormatWriter.scala:440)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.sql.catalyst.util.AbstractScalaRowIterator.foreach(AbstractScalaRowIterator.scala:26)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:440)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
... 8 more
Caused by: java.io.IOException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone
{
"code" : 500,
"errors" : [ {
"domain" : "global",
"message" : "Backend Error",
"reason" : "backendError"
} ],
"message" : "Backend Error"
}
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.waitForCompletionAndThrowIfUploadFailed(AbstractGoogleAsyncWriteChannel.java:432)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:287)
at java.nio.channels.Channels$1.close(Channels.java:178)
at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.close(GoogleHadoopOutputStream.java:126)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
at sun.nio.cs.StreamEncoder.implClose(StreamEncoder.java:320)
at sun.nio.cs.StreamEncoder.close(StreamEncoder.java:149)
at java.io.OutputStreamWriter.close(OutputStreamWriter.java:233)
at com.univocity.parsers.common.AbstractWriter.close(AbstractWriter.java:857)
... 20 more
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone
{
"code" : 500,
"errors" : [ {
"domain" : "global",
"message" : "Backend Error",
"reason" : "backendError"
} ],
"message" : "Backend Error"
}
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:358)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
答案 0 :(得分:2)
我尝试了你的代码而且它确实很慢 - 对我来说它耗时超过8分钟。
通过使用Dataframes而不是RDD来读取CSV文件,我获得了显着的加速(低于5分钟)。这避免了在JVM< - >之间传送所有数据。蟒蛇。这是我使用的代码:
<th>
答案 1 :(得分:1)
(这不是问题的答案,但这并不适合评论。它与repartition(...)
之前调用write.partitionBy
的帖子有关。
如果没有repartition(...)
,那将会永远占用GCS。在引擎盖下,当你说write.partitionBy(...)时,spark的任务将每个串行一次为每个分区写一个文件。这在HDFS上已经很慢,但由于GCS具有更高的延迟,因此速度会更慢。如果创建每个文件的时间为500毫秒,则每个任务写入2300个分区文件大约需要20分钟。
如果您对数据进行随机播放,则会引入另一个(&#34; reduce&#34;)阶段的任务,这些任务将以一个分区的所有数据结束。因此,您只需编写2300个文件,而不是编写2300 *前一阶段任务文件。这就是你想要的,特别是当你有多个分区时。
您可能希望尝试重新分区(...)所做的分区数(也就是减速器任务)。默认情况下它是200,但您可能想提高它。每个reducer将以2300个输出分区的子集结束,并将连续写入每个输出文件。再次,假设写一个文件需要500ms,2300/200 = 115个文件=每个任务约1分钟。如果你有更多的reducer,你将获得更多的并行性,所以每个任务将花费更少的时间。但是,您应该根据群集中的节点数设置减少器的数量(例如,4倍于vcores的数量)。
此外,您可能希望将spark.executor.cores提升为4(--properties spark.executor.cores=4
),因为这将非常符合IO。
答案 2 :(得分:0)
这不是问题的答案,而是现有要求的代码流程。
col1 col2 col3 col4 col5
asd234qsds 2014-01-02 23.99 2014-01-02 Y
2343fsdf55 2014-01-03 22.56 2014-01-03 Y
123fdfr555 2014-01-04 34.23 2014-01-04 N
2343fsdf5f 2014-01-05 45.33 2014-01-05 N
asd234qsds 2014-01-02 27.99 2014-01-07 Y
请注意:第一行和最后一行具有相同的键,但在Window函数期间,仅考虑最后一行。我的实际数据有51列,Window函数有9列。我不确定压缩数据是否会给这个过程增加任何开销。
lines1 = sc.textFile("gs://incrmental_file.txt*") -- uncompressed data 210KB
part1 = lines1.map(lambda l: l.split("|"))
df = part1.map(lambda c: Row(col1=c[0],col2=c[1],col3=c[2],col4=c[3], col5 =c[4]))
schema_df = spark.createDataFrame(df)
schema_df.createOrReplaceTempView("df")
#schema_incr_tbl = spark.sql("""select col1,col2,col3,col4,col5 from df""")
lines2 = sc.textFile("gs://hist_files.gz*") -- full year compressed data 38GiB
part2 = lines2.map(lambda l: l.split("|"))
df2 = part2.map(lambda c: Row(col1=c[0],col2=c[1],col3=c[2],col4=c[3], col5 =c[4]))
schema_df2 = spark.createDataFrame(df2)
schema_df2.createOrReplaceTempView("df2")
union_fn = schema_hist_tbl.union(schema_incr_tbl)
w = Window.partitionBy("col1","col2").orderBy(col("col4").desc())
union_result = union_fn.withColumn("row_num",
func.row_number().over(w)).where(col("row_num") == 1).drop("row_num").drop("col4")
union_result.createOrReplaceTempView("merged_tbl")
schema_merged_tbl = spark.sql("""select col1,col2,col3,col5,col5 as col6 merged_tbl""")
schema_merged_tbl.write.partitionBy( “COL6”)。格式( 'CSV')。模式( “覆盖”)。选项(定界符= DELIM,编解码器= “org.apache.hadoop.io.compress.GzipCodec”) .save( “hdfs_merge_path”)