目标:需要提取Cassandra中的数百万行,并尽快(每天)将其压缩到单个文件中。
当前设置使用Google Dataproc集群来运行Spark作业,该作业将数据直接提取到Google Cloud Storage存储桶中。我尝试了两种方法:
使用(现已弃用)FileUtil.copyMerge()将大约9000个Spark分区文件组合成一个未压缩的文件,然后提交Hadoop MapReduce作业来压缩该单个文件。
将大约9000个Spark分区文件保留为原始输出,并提交Hadoop MapReduce作业以将这些文件合并和压缩为单个文件。
一些工作细节: 大约8亿行。 Spark作业输出的大约9000个Spark分区文件。 Spark作业大约需要一个小时才能完成在1个Master,4个Worker(4vCPU,每个15GB)Dataproc集群上的运行。 Hadoop的默认Dataproc Hadoop块大小,我认为是128MB。
一些Spark配置详细信息:
spark.task.maxFailures=10
spark.executor.cores=4
spark.cassandra.input.consistency.level=LOCAL_ONE
spark.cassandra.input.reads_per_sec=100
spark.cassandra.input.fetch.size_in_rows=1000
spark.cassandra.input.split.size_in_mb=64
Hadoop作业:
hadoop jar file://usr/lib/hadoop-mapreduce/hadoop-streaming-2.8.4.jar
-Dmapred.reduce.tasks=1
-Dmapred.output.compress=true
-Dmapred.compress.map.output=true
-Dstream.map.output.field.separator=,
-Dmapred.textoutputformat.separator=,
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
-input gs://bucket/with/either/single/uncompressed/csv/or/many/spark/partition/file/csvs
-output gs://output/bucket
-mapper /bin/cat
-reducer /bin/cat
-inputformat org.apache.hadoop.mapred.TextInputFormat
-outputformat org.apache.hadoop.mapred.TextOutputFormat
该作业的信息输出:
INFO mapreduce.Job: Counters: 55
File System Counters
FILE: Number of bytes read=5072098452
FILE: Number of bytes written=7896333915
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
GS: Number of bytes read=47132294405
GS: Number of bytes written=2641672054
GS: Number of read operations=0
GS: Number of large read operations=0
GS: Number of write operations=0
HDFS: Number of bytes read=57024
HDFS: Number of bytes written=0
HDFS: Number of read operations=352
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Killed map tasks=1
Launched map tasks=353
Launched reduce tasks=1
Rack-local map tasks=353
Total time spent by all maps in occupied slots (ms)=18495825
Total time spent by all reduces in occupied slots (ms)=7412208
Total time spent by all map tasks (ms)=6165275
Total time spent by all reduce tasks (ms)=2470736
Total vcore-milliseconds taken by all map tasks=6165275
Total vcore-milliseconds taken by all reduce tasks=2470736
Total megabyte-milliseconds taken by all map tasks=18939724800
Total megabyte-milliseconds taken by all reduce tasks=7590100992
Map-Reduce Framework
Map input records=775533855
Map output records=775533855
Map output bytes=47130856709
Map output materialized bytes=2765069653
Input split bytes=57024
Combine input records=0
Combine output records=0
Reduce input groups=2539721
Reduce shuffle bytes=2765069653
Reduce input records=775533855
Reduce output records=775533855
Spilled Records=2204752220
Shuffled Maps =352
Failed Shuffles=0
Merged Map outputs=352
GC time elapsed (ms)=87201
CPU time spent (ms)=7599340
Physical memory (bytes) snapshot=204676702208
Virtual memory (bytes) snapshot=1552881852416
Total committed heap usage (bytes)=193017675776
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=47132294405
File Output Format Counters
Bytes Written=2641672054
我可以完全控制Spark作业和Hadoop作业。我知道我们可以创建一个更大的集群,但是我宁愿在确保优化工作本身之后再这样做。任何帮助表示赞赏。谢谢。
答案 0 :(得分:1)
能否提供您的Spark作业更多详细信息?您正在使用Spark的哪个API-RDD或Dataframe? 为什么不完全在Spark中执行合并阶段(使用repartition()。write()),并避免将Spark和MR作业链接在一起?