我想从一个简单的CSV文件创建一个Spark数据集。以下是CSV文件的内容:
name,state,number_of_people,coolness_index
trenton,nj,"10","4.5"
bedford,ny,"20","3.3"
patterson,nj,"30","2.2"
camden,nj,"40","8.8"
以下是制作数据集的代码:
var location = "s3a://path_to_csv"
case class City(name: String, state: String, number_of_people: Long)
val cities = spark.read
.option("header", "true")
.option("charset", "UTF8")
.option("delimiter",",")
.csv(location)
.as[City]
以下是错误消息:"无法将number_of_people
从字符串转换为bigint,因为它可能会截断"
Databricks讨论了如何在this blog post中创建数据集和此特定错误消息。
编码器急切地检查您的数据是否符合预期的架构, 在尝试不正确之前提供有用的错误消息 处理TB数据。例如,如果我们尝试使用的数据类型 太小,以至于导致转换为对象 截断(即numStudents大于一个字节,它保存一个 最大值为255)分析器将发出AnalysisException。
我使用的是Long
类型,因此我不希望看到此错误消息。
答案 0 :(得分:17)
使用架构推理:
val cities = spark.read
.option("inferSchema", "true")
...
或提供架构:
val cities = spark.read
.schema(StructType(Array(StructField("name", StringType), ...)
或演员:
val cities = spark.read
.option("header", "true")
.csv(location)
.withColumn("number_of_people", col("number_of_people").cast(LongType))
.as[City]
答案 1 :(得分:2)
使用您的案例类City(名称:String,state:String,number_of_people:Long), 你只需要一行
private val cityEncoder = Seq(City("", "", 0)).toDS
然后你编码
val cities = spark.read
.option("header", "true")
.option("charset", "UTF8")
.option("delimiter",",")
.csv(location)
.as[City]
会起作用。
这是官方来源[http://spark.apache.org/docs/latest/sql-programming-guide.html#overview][1]
答案 2 :(得分:-1)
Input csv file User.csv
id,name,address
1,Arun,Indore
2,Shubham,Indore
3,Mukesh,Hariyana
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("test").setMaster("local");
SparkSession sparkSession = new SparkSession(new SparkContext(sparkConf));
Dataset<Row> dataset = sparkSession.read().option("header", "true")
.csv("C:\\Users\\arun7.gupta\\Desktop\\Spark\\User.csv");
dataset.show();
sparkSession.close();
}
**Output:**
+---+-------+--------+
| id| name| address|
+---+-------+--------+
| 1| Arun| Indore|
| 2|Shubham| Indore|
| 3| Mukesh|Hariyana|
+---+-------+--------+