我试图深入理解textFile
方法,但我想我的
缺乏Hadoop知识让我回到这里。让我布置我的
理解,也许你可以纠正任何不正确的事情
调用sc.textFile(path)
时,会使用defaultMinPartitions
,
这真的只是math.min(taskScheduler.defaultParallelism, 2)
。让我们
假设我们使用SparkDeploySchedulerBackend
,这是
conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(),
2))
所以,现在让我们说默认值是2,回到textFile
,这是
传入HadoopRDD
。真实大小由getPartitions()
确定
inputFormat.getSplits(jobConf, minPartitions)
。但是,从我能找到的,
分区只是一个提示,实际上大部分被忽略,所以你会
可能会获得总块数。
好的,这符合预期,但如果没有使用默认值,那该怎么办? 您提供的分区大小大于块大小。如果我的 研究是正确的,然后getSplits调用忽略了这个参数 不会被提供的最终结果被忽略,你仍然会得到 块大小?
答案 0 :(得分:2)
简短版本:
拆分大小由mapred.min.split.size
或mapreduce.input.fileinputformat.split.minsize
确定,如果它大于HDFS的blockSize,则同一文件中的多个块将合并为一个拆分。< / p>
详细版本:
我认为你在inputFormat.getSplits
之前理解程序是正确的。
内部inputFormat.getSplits
,更具体地说,在FileInputFormat's getSplits
内,最终确定分割大小的是mapred.min.split.size
或mapreduce.input.fileinputformat.split.minsize
。 (我不确定哪个会对Spark有效,我更愿意相信前者)。
让我们看一下代码:FileInputFormat from Hadoop 2.4.0
long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
long minSize = Math.max(job.getLong(org.apache.hadoop.mapreduce.lib.input.
FileInputFormat.SPLIT_MINSIZE, 1), minSplitSize);
// generate splits
ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);
NetworkTopology clusterMap = new NetworkTopology();
for (FileStatus file: files) {
Path path = file.getPath();
long length = file.getLen();
if (length != 0) {
FileSystem fs = path.getFileSystem(job);
BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus) file).getBlockLocations();
} else {
blkLocations = fs.getFileBlockLocations(file, 0, length);
}
if (isSplitable(fs, path)) {
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(goalSize, minSize, blockSize);
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
String[] splitHosts = getSplitHosts(blkLocations,
length-bytesRemaining, splitSize, clusterMap);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
splitHosts));
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
String[] splitHosts = getSplitHosts(blkLocations, length
- bytesRemaining, bytesRemaining, clusterMap);
splits.add(makeSplit(path, length - bytesRemaining, bytesRemaining,
splitHosts));
}
} else {
String[] splitHosts = getSplitHosts(blkLocations,0,length,clusterMap);
splits.add(makeSplit(path, 0, length, splitHosts));
}
} else {
//Create empty hosts array for zero length files
splits.add(makeSplit(path, 0, length, new String[0]));
}
}
在for循环中,makeSplit()
用于生成每个分割,splitSize
是有效的分割大小。 computeSplitSize函数用于生成splitSize
:
protected long computeSplitSize(long goalSize, long minSize,
long blockSize) {
return Math.max(minSize, Math.min(goalSize, blockSize));
}
因此,如果minSplitSize&gt; blockSize,输出分割实际上是同一HDFS文件中几个块的组合,另一方面,如果minSplitSize&lt; blockSize,每个分割对应一个HDFS的块。
答案 1 :(得分:0)
I will add more points with examples to Yijie Shen answer
Before we go into details,lets understand the following
Assume that we are working on Spark Standalone local system with 4 cores
In the application if master is configured as like below
new SparkConf().setMaster("**local[*]**") then
defaultParallelism : 4 (taskScheduler.defaultParallelism ie no.of cores)
/* Default level of parallelism to use when not given by user (e.g. parallelize and makeRDD). */
defaultMinPartitions : 2 //Default min number of partitions for Hadoop RDDs when not given by user
* Notice that we use math.min so the "defaultMinPartitions" cannot be higher than 2.
找到defaultMinPartitions的逻辑如下
def defaultMinPartitions:Int = math.min(defaultParallelism,2)
The actual partition size is defined by the following formula in the method FileInputFormat.computeSplitSize
package org.apache.hadoop.mapred;
public abstract class FileInputFormat<K, V> implements InputFormat<K, V> {
protected long computeSplitSize(long goalSize, long minSize, long blockSize) {
return Math.max(minSize, Math.min(goalSize, blockSize));
}
}
where,
minSize is the hadoop parameter mapreduce.input.fileinputformat.split.minsize (default mapreduce.input.fileinputformat.split.minsize = 1 byte)
blockSize is the value of the dfs.block.size in cluster mode(**dfs.block.size - The default value in Hadoop 2.0 is 128 MB**) and fs.local.block.size in the local mode (**default fs.local.block.size = 32 MB ie blocksize = 33554432 bytes**)
goalSize = totalInputSize/numPartitions
where,
totalInputSize is the total size in bytes of all the files in the input path
numPartitions is the custom parameter provided to the method sc.textFile(inputPath, numPartitions) - if not provided it will be defaultMinPartitions ie 2 if master is set as local(*)
blocksize = file size in bytes = 33554432
33554432/1024 = 32768 KB
32768/1024 = 32 MB
Ex1:- If our file size is 91 bytes
minSize=1 (mapreduce.input.fileinputformat.split.minsize = 1 byte)
goalSize = totalInputSize/numPartitions
goalSize = 91(file size)/12(partitions provided as 2nd paramater in sc.textFile) = 7
splitSize = Math.max(minSize, Math.min(goalSize, blockSize)); => Math.max(1,Math.min(7,33554432)) = 7 // 33554432 is block size in local mode
Splits = 91(file size 91 bytes) / 7 (splitSize) => 13
FileInputFormat: Total # of splits generated by getSplits: 13
=&GT;计算splitSize时,如果文件大小> 32 MB然后拆分大小将采用默认fs.local.block.size = 32 MB即blocksize = 33554432 bytes