Question

在以下两个示例中，运行的任务数和相应的运行时间暗示采样选项无效，因为它们类似于没有任何采样选项的作业运行：

contentOffset

我知道显式架构最适合性能，但在方便的情况下，采样是有用的。

Spark的新手，我是否正确使用了这些选项？在PySpark中尝试了相同的方法，结果相同：

val df = spark.read.options("samplingRatio",0.001).json("s3a://test/*.json.bz2")

val df = spark.read.option("sampleSize",100).json("s3a://test/*.json.bz2")

Answer 1

TL; DR 您使用的所有选项均不会对执行时间产生重大影响：

sampleSize不在有效的JSONOptions或JSONOptionsInRead中，因此将被忽略。
samplingRatio是有效选项，但在内部使用PartitionwiseSampledRDD，因此该过程为linear in terms of the number of records。因此，采样只能减少推理成本，而不是IO，这可能是瓶颈。
将samplingRatio设置为None等同于不采样。 PySpark OptionUtils simply discard None options和sampleRatio defaults to 1.0。

您可以尝试显式采样数据。在Python中

from pyspark.sql import SparkSession
from pyspark.sql.types import StructField 

def infer_json_schema(path: str, sample_size: int, **kwargs: str) -> StructType:
    spark = SparkSession.builder.getOrCreate()
    sample = spark.read.text(path).limit(sample_size).rdd.flatMap(lambda x: x)
    return spark.read.options(**kwargs).json(sample).schema

在Scala中：

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType

def inferJsonSchema(
    path: String, sampleSize: Int, options: Map[String, String]): StructType = {
  val spark = SparkSession.builder.getOrCreate()
  val sample = spark.read.text(path).limit(sampleSize).as[String]
  spark.read.options(options).json(sample).schema
}

请记住，要正常工作，样本大小最多应等于分区的预期大小。 Spark中的限制迅速升级（例如，从my answer到Spark count vs take and length），您可以轻松地结束对整个输入的扫描。

JSON Reader中的Spark采样选项被忽略了吗？

1 个答案: