Question

我想从文本文件创建Dataframe。

案例类限制为22个字符;我有超过100个领域。

因此我在创建Case Class时面临问题。

我的实际目标是创建Dataframe;

有没有其他方法可以创建Dataframe，而不是使用Case Class？

Answer 1

一种方法是使用spark csv包直接读取文件并创建数据帧。如果您的文件有标题，或者您可以使用结构类型创建自定义模式，Package将直接从标题中推断出模式。

在下面的示例中，我创建了一个自定义架构。

val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
    StructField("year", IntegerType, true),
    StructField("make", StringType, true),
    StructField("model", StringType, true),
    StructField("comment", StringType, true),
    StructField("blank", StringType, true)))

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .schema(customSchema)
    .load("cars.csv")

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load("cars.csv")

您可以查看databricks spark csv documentation page上的其他各种选项。

其他选项：

您可以使用结构类型创建一个模式，如上所示，然后使用createDataframe sqlContext创建数据框。

val vRdd = sc.textFile(..filelocation..)
val df = sqlContext.createDataframe(vRdd,schema)

Answer 2

<强> From the Spark Documentation:

如果无法提前定义案例类（例如，记录的结构以字符串形式编码，或者文本数据集将被解析，字段将针对不同的用户进行不同的投影），则可以通过编程方式创建DataFrame分三步。

从原始RDD创建行的RDD;
创建由StructType表示的架构，该架构与步骤1中创建的RDD中的行结构相匹配。
通过createDataFrame提供的SQLContext方法将架构应用于行的RDD。

其他方法是使用StructField内的datatyoe定义StructType。它将允许您定义多种数据类型。请参阅下面的示例以了解这两种实现方式。请考虑注释代码以了解这两种实现。

package com.spark.examples

import org.apache.spark._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql._
import org.apache.spark._
import org.apache.spark.sql.DataFrame
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.apache.spark.sql.types._

// Import Row.
import org.apache.spark.sql.Row;
// Import Spark SQL data types
import org.apache.spark.sql.types.{ StructType, StructField, StringType }

object MultipleDataTypeSchema extends Serializable {

  val conf = new SparkConf().setAppName("schema definition")

  conf.set("spark.executor.memory", "100M")
  conf.setMaster("local")

  val sc = new SparkContext(conf);
  // sc is an existing SparkContext.
  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
  def main(args: Array[String]): Unit = {

    // Create an RDD
    val people = sc.textFile("C:/Users/User1/Documents/test")

    /* First Implementation:The schema is encoded in a string, split schema then map it.
     * All column dataype will be string type.

    //Generate the schema based on the string of schema
    val schemaString = "name address age" //Here you can read column from a preoperties file too.  
    val schema =
      StructType(
        schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)));*/

    // Second implementation: Define multiple datatype 

    val schema =
      StructType(
        StructField("name", StringType, true) ::
          StructField("address", StringType, true) ::
          StructField("age", StringType, false) :: Nil)

    // Convert records of the RDD (people) to Rows.
    val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim, p(2).trim))
    // Apply the schema to the RDD.
    val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
    peopleDataFrame.printSchema()

    sc.stop

  }
}

其输出：

17/01/03 14:24:13 INFO SparkContext: Created broadcast 0 from textFile at MultipleDataTypeSchema.scala:30
root
 |-- name: string (nullable = true)
 |-- address: string (nullable = true)
 |-- age: string (nullable = false)

Answer 3

通过sqlContext的sqlContext.read.csv（）方法读取文件效果很好。因为它有许多可用的内置方法，您可以传递参数并控制执行。但是在1.6之前的火花版本上工作可能没有这个。所以你也可以通过spark-context的textFile方法来实现。

Val a = sc.textFile("file:///file-path/fileName")

这会给你一个RDD [String]。所以你现在已经创建了RDD，并且想要将其转换为数据帧。

现在继续使用StructTypes为您的RDD定义架构。这允许您拥有所需数量的StructField。

val schema = StructType(Array(StructField("fieldName1", fieldType, ifNullablle),
                              StructField("fieldName2", fieldType, ifNullablle),
                              StructField("fieldName3", fieldType, ifNullablle),
                              ................
                              ))

您现在有两件事：1）RDD，我们使用textFile方法创建。 2）模式，具有所需的属性数。

下一步绝对是将此架构映射到您的RDD吧！您可能会发现您拥有的RDD是一个字符串，即RDD [String]。但是你真正想要做的是将它转换为你为其创建模式的许多变量。那么为什么不基于逗号分割您的RDD。以下表达式应使用map操作执行此操作。

val b = a.map(x => x.split(","))

你在评估时得到一个RDD [Array [String]]。

但你可能会说这个Array [String]仍然不是那么直观，我可以应用任何操作。因此，您可以使用Row API来解决问题。使用import org.apache.spark.sql.Row导入它我们实际上将你的分裂RDD与Row对象映射为元组。见：

import org.apache.spark.sql.Row
val c = b.map(x => Row(x(0), x(1),....x(n)))

上面的表达式为您提供了一个RDD，其中每个元素都是一行。你现在只需要给它一个模式。再次，sqlContext的createDataFrame方法可以为您完成这项工作。

val myDataFrame = sqlContext.createDataFrame(c, schema)

此方法有两个参数：1）您需要处理的RDD。 2）您要在其上应用的架构。结果评估是DataFrame对象。最后我们现在创建了我们的DataFrame对象myDataFrame。如果在myDataFrame上使用show方法，则可以以表格格式查看数据。你现在可以对它执行任何spark-sql操作。

如何创建不使用Case Class的DataFrame？

3 个答案: