Question

我想使用一小部分行（例如$rowsWithErrs | Export-Csv -Path $rowErrCsvPath -NoTypeInformation -Encoding UTF7）从CSV文件目录中推断出Spark.DataFrame模式。

但是，将limit(100)设置为inferSchema意味着True的{{1}}似乎总是等于所有CSV文件中的行数。

是否有一种方法可以使FileScan更具选择性，以便在推断架构时Spark看到更少的行？

注意：将samplingRatio option设置为<1.0不会产生所需的行为，尽管很明显inferSchema仅使用采样的行子集。

Answer 1

您可以将输入数据的子集读入String的dataSet中。 CSV方法允许您将此作为参数传递。

这是一个简单的示例（我将继续从输入文件中读取行样本给您）：

val data = List("1,2,hello", "2,3,what's up?")
val csvRDD = sc.parallelize(data)
val df = spark.read.option("inferSchema","true").csv(csvRDD.toDS)
df.schema

在spark-shell中运行时，上述打印的最后一行（为便于阅读，我将其重新格式化）：

res4: org.apache.spark.sql.types.StructType = 
    StructType(
      StructField(_c0,IntegerType,true),
      StructField(_c1,IntegerType,true),
      StructField(_c2,StringType,true)
    )

哪个是我有限的输入数据集的正确架构。

Answer 2

假设您只对模式感兴趣，这是一种基于cipri.l在本link中的帖子的可能方法

import org.apache.spark.sql.execution.datasources.csv.{CSVOptions, TextInputCSVDataSource}
def inferSchemaFromSample(sparkSession: SparkSession, fileLocation: String, sampleSize: Int, isFirstRowHeader: Boolean): StructType = {
  // Build a Dataset composed of the first sampleSize lines from the input files as plain text strings
  val dataSample: Array[String] = sparkSession.read.textFile(fileLocation).head(sampleSize)
  import sparkSession.implicits._
  val sampleDS: Dataset[String] = sparkSession.createDataset(dataSample)
  // Provide information about the CSV files' structure
  val firstLine = dataSample.head
  val extraOptions = Map("inferSchema" -> "true",   "header" -> isFirstRowHeader.toString)
  val csvOptions: CSVOptions = new CSVOptions(extraOptions, sparkSession.sessionState.conf.sessionLocalTimeZone)
  // Infer the CSV schema based on the sample data
  val schema = TextInputCSVDataSource.inferFromDataset(sparkSession, sampleDS, Some(firstLine), csvOptions)
  schema
}

与GMc的回答不同，此方法尝试直接以与DataFrameReader.csv（）在后台执行的方式相同的方式推断模式（但无需通过这种方式来构建其他数据集）模式，那么我们将只使用它来从中检索模式）

根据仅包含输入文件中的前sampleSize行作为纯文本字符串的Dataset [String]推断模式。

当尝试从数据中检索样本时，Spark只有两种类型的方法：

检索给定百分比数据的方法。此操作从所有分区中抽取随机样本。它得益于更高的并行度，但必须读取所有输入文件。
检索特定行数的方法。此操作必须在驱动程序上收集数据，但是它可以读取单个分区（如果所需的行数足够低）

由于您提到要使用少量的特定行，并且由于要避免接触所有数据，因此我提供了基于选项2的解决方案

PS：DataFrameReader.textFile方法接受文件，文件夹的路径，并且还具有varargs变体，因此您可以传入一个或多个文件或文件夹。

在read.csv期间使用限制触发Spark推理模式

2 个答案: