我知道在HDFS中读取大量小文件的问题一直是一个问题,已经得到广泛讨论,但是请耐心等待。处理此类问题的大多数stackoverflow问题都与读取大量txt文件有关。我正在尝试读取大量小型avro文件
此外,这些阅读txt文件的解决方案都讨论使用RDD实现的WholeTextFileInputFormat或CombineInputFormat(https://stackoverflow.com/a/43898733/11013878),我使用的是Spark 2.4(HDFS 3.0.0),通常不建议使用RDD实现,并且首选数据帧。我更喜欢使用数据框,但也可以接受RDD实现。
我尝试按照Murtaza的建议合并数据帧,但是在大量文件上出现OOM错误(https://stackoverflow.com/a/32117661/11013878)
我正在使用以下代码
val filePaths = avroConsolidator.getFilesInDateRangeWithExtension //pattern:filePaths: Array[String]
//I do need to create a list of file paths as I need to filter files based on file names. Need this logic for some upstream process
//example : Array("hdfs://server123:8020/source/Avro/weblog/2019/06/03/20190603_1530.avro","hdfs://server123:8020/source/Avro/weblog/2019/06/03/20190603_1531.avro","hdfs://server123:8020/source/Avro/weblog/2019/06/03/20190603_1532.avro")
val df_mid = sc.read.format("com.databricks.spark.avro").load(filePaths: _*)
val df = df_mid
.withColumn("dt", date_format(df_mid.col("timeStamp"), "yyyy-MM-dd"))
.filter("dt != 'null'")
df
.repartition(partitionColumns(inputs.logSubType).map(new org.apache.spark.sql.Column(_)):_*)
.write.partitionBy(partitionColumns(inputs.logSubType): _*)
.mode(SaveMode.Append)
.option("compression","snappy")
.parquet(avroConsolidator.parquetFilePath.toString)
avro文件存储在yyyy / mm / dd分区中:hdfs:// server123:8020 / source / Avro / weblog / 2019/06/03
有什么方法可以加快叶子文件的列表速度,因为从屏幕快照中可以看到,仅需6秒钟即可将其合并到镶木地板文件中,而只需1.3分钟即可列出文件
答案 0 :(得分:1)
由于读取大量小文件花费的时间太长,我退后一步,并使用 CombineFileInputFormat 创建了RDD。此InputInput格式可很好地处理小型文件,因为它将许多文件打包为一个拆分,因此映射器更少,每个映射器有更多数据要处理。
这就是我所做的:
def createDataFrame(filePaths: Array[Path], sc: SparkSession, inputs: AvroConsolidatorInputs): DataFrame = {
val job: Job = Job.getInstance(sc.sparkContext.hadoopConfiguration)
FileInputFormat.setInputPaths(job, filePaths: _*)
val sqlType = SchemaConverters.toSqlType(getSchema(inputs.logSubType))
val rddKV = sc.sparkContext.newAPIHadoopRDD(
job.getConfiguration,
classOf[CombinedAvroKeyInputFormat[GenericRecord]],
classOf[AvroKey[GenericRecord]],
classOf[NullWritable])
val rowRDD = rddKV.mapPartitions(
f = (iter: Iterator[(AvroKey[GenericRecord], NullWritable)]) =>
iter.map(_._1.datum()).map(genericRecordToRow(_, sqlType))
, preservesPartitioning = true)
val df = sc.sqlContext.createDataFrame(rowRDD ,
sqlType.dataType.asInstanceOf[StructType])
df
CombinedAvroKeyInputFormat是用户定义的类,该类扩展了CombineFileInputFormat并在单个拆分中放入64MB数据。
object CombinedAvroKeyInputFormat {
class CombinedAvroKeyRecordReader[T](var inputSplit: CombineFileSplit, context: TaskAttemptContext, idx: Integer)
extends AvroKeyRecordReader[T](AvroJob.getInputKeySchema(context.getConfiguration))
{
@throws[IOException]
@throws[InterruptedException]
override def initialize(inputSplit: InputSplit, context: TaskAttemptContext): Unit = {
this.inputSplit = inputSplit.asInstanceOf[CombineFileSplit]
val fileSplit = new FileSplit(this.inputSplit.getPath(idx),
this.inputSplit.getOffset(idx),
this.inputSplit.getLength(idx),
this.inputSplit.getLocations)
super.initialize(fileSplit, context)
}
}
}
/*
* The class CombineFileInputFormat is an abstract class with no implementation, so we must create a subclass to support it;
* We’ll name the subclass CombinedAvroKeyInputFormat. The subclass will initiate a delegate CombinedAvroKeyRecordReader that extends AvroKeyRecordReader
*/
class CombinedAvroKeyInputFormat[T] extends CombineFileInputFormat[AvroKey[T], NullWritable] {
val logger = Logger.getLogger(AvroConsolidator.getClass)
setMaxSplitSize(67108864)
def createRecordReader(split: InputSplit, context: TaskAttemptContext): RecordReader[AvroKey[T], NullWritable] = {
val c = classOf[CombinedAvroKeyInputFormat.CombinedAvroKeyRecordReader[_]]
val inputSplit = split.asInstanceOf[CombineFileSplit]
/*
* CombineFileRecordReader is a built in class that pass each split to our class CombinedAvroKeyRecordReader
* When the hadoop job starts, CombineFileRecordReader reads all the file sizes in HDFS that we want it to process,
* and decides how many splits base on the MaxSplitSize
*/
return new CombineFileRecordReader[AvroKey[T], NullWritable](
inputSplit,
context,
c.asInstanceOf[Class[_ <: RecordReader[AvroKey[T], NullWritable]]])
}
}
这使读取小文件的速度大大加快
答案 1 :(得分:0)
我有一个类似的问题,该问题是通过以下方式从AWS S3读取100个小型avro文件的:
spark.read.format("avro").load(<file_directory_path_containing_many_avro_files>)
在完成大多数预定任务之后,该作业将挂在各个位置。例如,它将在25秒内快速完成111个任务中的110个任务,并在110次挂起,而下一次尝试将在111个任务中的98号任务挂起。它没有通过挂起点。
在这里阅读了类似问题后:https://blog.yuvalitzchakov.com/leveraging-spark-speculation-to-identify-and-re-schedule-slow-running-tasks/
此处引用了spark配置指南:
尽管不是解决挂起的原始原因的方法,但事实证明,下面的火花配置是一种快速修复和解决方法。
将spark.speculation
设置为true可解决此问题。