Question

我想从一个简单的CSV文件创建一个Spark数据集。以下是CSV文件的内容：

name,state,number_of_people,coolness_index
trenton,nj,"10","4.5"
bedford,ny,"20","3.3"
patterson,nj,"30","2.2"
camden,nj,"40","8.8"

以下是制作数据集的代码：

var location = "s3a://path_to_csv"

case class City(name: String, state: String, number_of_people: Long)

val cities = spark.read
  .option("header", "true")
  .option("charset", "UTF8")
  .option("delimiter",",")
  .csv(location)
  .as[City]

以下是错误消息：＆＃34;无法将number_of_people从字符串转换为bigint，因为它可能会截断＆＃34;

Databricks讨论了如何在this blog post中创建数据集和此特定错误消息。

编码器急切地检查您的数据是否符合预期的架构，在尝试不正确之前提供有用的错误消息处理TB数据。例如，如果我们尝试使用的数据类型太小，以至于导致转换为对象截断（即numStudents大于一个字节，它保存一个最大值为255）分析器将发出AnalysisException。

我使用的是Long类型，因此我不希望看到此错误消息。

Answer 1

使用架构推理：

val cities = spark.read
  .option("inferSchema", "true")
  ...

或提供架构：

val cities = spark.read
  .schema(StructType(Array(StructField("name", StringType), ...)

或演员：

val cities = spark.read
  .option("header", "true")
  .csv(location)
  .withColumn("number_of_people", col("number_of_people").cast(LongType))
  .as[City]

Answer 2

使用您的案例类City（名称：String，state：String，number_of_people：Long），你只需要一行

private val cityEncoder = Seq(City("", "", 0)).toDS

然后你编码

val cities = spark.read
.option("header", "true")
.option("charset", "UTF8")
.option("delimiter",",")
.csv(location)
.as[City]

会起作用。

这是官方来源[http://spark.apache.org/docs/latest/sql-programming-guide.html#overview][1]

Answer 3

Input csv file User.csv
id,name,address
1,Arun,Indore
2,Shubham,Indore
3,Mukesh,Hariyana

public static void main(String[] args) {
        SparkConf sparkConf = new SparkConf().setAppName("test").setMaster("local");
        SparkSession sparkSession = new SparkSession(new SparkContext(sparkConf));

        Dataset<Row> dataset = sparkSession.read().option("header", "true")
                .csv("C:\\Users\\arun7.gupta\\Desktop\\Spark\\User.csv");

        dataset.show();
        sparkSession.close();
    }

**Output:** 
+---+-------+--------+
| id|   name| address|
+---+-------+--------+
|  1|   Arun|  Indore|
|  2|Shubham|  Indore|
|  3| Mukesh|Hariyana|
+---+-------+--------+

从CSV文件

3 个答案: