我正在尝试不使用case类将RDD转换为Dataframe。 csv文件如下所示:
3,193080,De Gea <br>
0,158023,L. Messi <br>
4,192985,K. De Bruyne <br>
1,20801,Cristiano Ronaldo <br>
2,190871,Neymar Jr <br>
val players = sc.textFile("/Projects/Downloads/players.csv").map(line => line.split(',')).map(r => Row(r(1),r(2),r(3)))
# players: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[230] at map at <console>:34
val schema = StructType(List(StructField("id",IntegerType),StructField("age",IntegerType),StructField("name",StringType)))
# schema: org.apache.spark.sql.types.StructType = StructType(StructField(id,IntegerType,true), StructField(age,IntegerType,true), StructField(name,StringType,true))
val playersDF = spark.createDataFrame(players,schema)
# playersDF: org.apache.spark.sql.DataFrame = [id: int, age: int ... 1 more field]
一切顺利,直到我尝试做一个playersDF.show
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.String is not a valid external type for schema of int
我该怎么办?
答案 0 :(得分:1)
您有两个问题:
1)您的索引已关闭; Scala基于0。 Row(r(1),r(2),r(3))
应该是Row(r(0),r(1),r(2))
。
2)line.split
返回Array[String]
,而您的架构指示第一和第二字段应为整数。创建数据框之前,需要将它们转换为整数。
基本上,这就是创建players
的方式:
val players = rdd.map(line => line.split(","))
.map(r => Row(r(0).toInt, r(1).toInt, r(2)))
答案 1 :(得分:1)
我认为最好的选择是提供一个架构并使用existing facilities读取csv文件。
ReSharper Debugger DataTip
结果如下:
import org.apache.spark.sql.types._
val playerSchema = StructType(Array(
StructField("id", IntegerType, true),
StructField("age", IntegerType, true),
StructField("name", StringType, true)
))
val players = spark
.sqlContext
.read
.format("csv")
.option("delimiter", ",")
.schema(playerSchema)
.load("/mypath/players.csv")
答案 2 :(得分:0)
//Input
StudentId,Name,Address
101,Shoaib,Anwar Layout
102,Shahbaz,Sara padlya
103,Fahad,Munredy padlya
104,Sana,Tannery Road
105,Zeeshan,Muslim colony
106,Azeem,Khusal nagar
107,Nazeem,KR puram
import org.apache.spark.sql.{Row, SQLContext, types}
import org.apache.spark.sql.types._
import org.apache.spark.{SparkConf, SparkContext}
object SparkCreateDFWithRDD {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Creating DF WITH RDD").setMaster("local")
val sc = new SparkContext(conf)
val sqlcontext = new SQLContext(sc)
val rdd = sc.textFile("/home/cloudera/Desktop/inputs/studentDetails1.csv")
val header = rdd.first()
val rddData = rdd.filter(x => x != header).map(x => {
val arr = x.split(",")
Row(arr(0).toInt, arr(1), arr(2))
})
val schemas = StructType(Array(StructField("StudentId",IntegerType,false),
StructField("StudentName",StringType,false),StructField("StudentAddress",StringType,true)))
val df = sqlcontext.createDataFrame(rddData,schemas)
df.printSchema()
df.show()
}
}
+---------+-----------+--------------+
|StudentId|StudentName|StudentAddress|
+---------+-----------+--------------+
| 101| Shoaib| Anwar Layout|
| 102| Shahbaz| Sara padlya|
| 103| Fahad|Munredy padlya|
| 104| Sana| Tannery Road|
| 105| Zeeshan| Muslim colony|
| 106| Azeem| Khusal nagar|
| 107| Nazeem| KR puram|
+---------+-----------+--------------+