我对Spark和Scala很新(就像两个小时一样新),我正在尝试使用CSV数据文件,但我不能这样做,因为我不知道如何处理“Header row”,我已经搜索了互联网的加载方式或跳过它,但我真的不知道该怎么做。 我正在粘贴我正在使用的代码,请帮助我。
object TaxiCaseOne{
case class NycTaxiData(Vendor_Id:String, PickUpdate:String, Droptime:String, PassengerCount:Int, Distance:Double, PickupLong:String, PickupLat:String, RateCode:Int, Flag:String, DropLong:String, DropLat:String, PaymentMode:String, Fare:Double, SurCharge:Double, Tax:Double, TripAmount:Double, Tolls:Double, TotalAmount:Double)
def mapper(line:String): NycTaxiData = {
val fields = line.split(',')
val data:NycTaxiData = NycTaxiData(fields(0), fields(1), fields(2), fields(3).toInt, fields(4).toDouble, fields(5), fields(6), fields(7).toInt, fields(8), fields(9),fields(10),fields(11),fields(12).toDouble,fields(13).toDouble,fields(14).toDouble,fields(15).toDouble,fields(16).toDouble,fields(17).toDouble)
return data
}def main(args: Array[String]) {
// Set the log level to only print errors
Logger.getLogger("org").setLevel(Level.ERROR)
// Use new SparkSession interface in Spark 2.0
val spark = SparkSession
.builder
.appName("SparkSQL")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp") // Necessary to work around a Windows bug in Spark 2.0.0; omit if you're not on Windows.
.getOrCreate()
val lines = spark.sparkContext.textFile("../nyc.csv")
val data = lines.map(mapper)
// Infer the schema, and register the DataSet as a table.
import spark.implicits._
val schemaData = data.toDS
schemaData.printSchema()
schemaData.createOrReplaceTempView("data")
// SQL can be run over DataFrames that have been registered as a table
val vendor = spark.sql("SELECT * FROM data WHERE Vendor_Id == 'CMT'")
val results = teenagers.collect()
results.foreach(println)
spark.stop()
}
}
答案 0 :(得分:0)
如果您有CSV文件,则应使用spark-csv来阅读csv文件,而不是使用textFile
val spark = SparkSession.builder().appName("test val spark = SparkSession
.builder
.appName("SparkSQL")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp") // Necessary to work around a Windows bug in Spark 2.0.0; omit if you're not on Windows.
.getOrCreate()
val df = spark.read
.format("csv")
.option("header", "true") //This identifies first line as header
.csv("../nyc.csv")
您需要spark-core和spark-sql依赖关系才能使用此
希望这有帮助!