我有一个分区的Hive表。如果我想从该表中创建一个Spark数据帧,那么将创建多少个数据帧分区?
答案 0 :(得分:0)
它不依赖于Hive表分区,而是取决于您使用的是哪个版本的Spark:
对于Spark <2.0
***using rdd and then creating datframe***
If you are creating an RDD , you can explicitly give no of partitions:
val rdd = sc.textFile("filepath" , 4)
as in above example it is 4 .
***directly creating datframe***
It depends on the Hadoop configuration (min / max split size)
You can use Hadoop configuration options:
mapred.min.split.size.
mapred.max.split.size
as well as HDFS block size to control partition size for filesystem based formats*.
val minSplit: Int = ???
val maxSplit: Int = ???
sc.hadoopConfiguration.setInt("mapred.min.split.size", minSplit)
sc.hadoopConfiguration.setInt("mapred.max.split.size", maxSplit)
对于spark> 2.0
***using rdd and then creating datframe*** :
same as mentioned in spark <2.0
***directly creating datframe***
You can use spark.sql.files.maxPartitionBytes configuration:
spark.conf.set("spark.sql.files.maxPartitionBytes", maxSplit)
Also keep in mind:
Datasets created from RDD inherit number of partitions from its parent.