我正在尝试使用SparkSession将CSV数据加载到Hive表中。 我想在加载到配置单元表时跳过标头数据,并设置tblproperties(“ skip.header.line.count” =“ 1”)也不起作用。
我正在使用以下代码。
import java.io.File
import org.apache.spark.sql.{SparkSession,Row,SaveMode}
case class Record(key: Int, value: String)
val warehouseLocation=new File("spark-warehouse").getAbsolutePath
val spark=SparkSession.builder().appName("Apache Spark Book Crossing Analysis").config("spark.sql.warehouse.dir",warehouseLocation).enableHiveSupport().getOrCreate()
import spark.implicits._
import spark.sql
//sql("set hive.vectorized.execution.enabled=false")
sql("drop table if exists BookTemp")
sql ("create table BookTemp(ISBN int,BookTitle String,BookAuthor String ,YearOfPublication int,Publisher String,ImageURLS String,ImageURLM String,ImageURLL String)row format delimited fields terminated by ';' ")
sql("alter table BookTemp set TBLPROPERTIES("skip.header.line.count"="1")")
sql("load data local inpath 'BX-Books.csv' into table BookTemp")
sql("select * from BookTemp limit 5").show
consol错误:
res55: org.apache.spark.sql.DataFrame = []
<console>:1: error: ')' expected but '.' found.
sql("alter table BookTemp set TBLPROPERTIES("skip.header.line.count"="1")")
2019-02-20 22:48:09 WARN LazyStruct:151 - Extra bytes detected at the end of the row! Ignoring similar problems.
+----+--------------------+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+
|ISBN| BookTitle| BookAuthor|YearOfPublication| Publisher| ImageURLS| ImageURLM| ImageURLL|
+----+--------------------+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+
|null| "Book-Title"| "Book-Author"| null| "Publisher"| "Image-URL-S"| "Image-URL-M"| "Image-URL-L"|
|null|"Classical Mythol...|"Mark P. O. Morford"| null|"Oxford Universit...|"http://images.am...|"http://images.am...|"http://images.am...|
|null| "Clara Callan"|"Richard Bruce Wr...| null|"HarperFlamingo C...|"http://images.am...|"http://images.am...|"http://images.am...|
|null|"Decision in Norm...| "Carlo D'Este"| null| "HarperPerennial"|"http://images.am...|"http://images.am...|"http://images.am...|
|null|"Flu: The Story o...| "Gina Bari Kolata"| null|"Farrar Straus Gi...|"http://images.am...|"http://images.am...|"http://images.am...|
+----+--------------------+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+
only showing top 5 rows
如结果所示,我要跳过第一行数据
答案 0 :(得分:1)
我正尝试使用spark sql将带有Header的CSV转换为镶木地板的另一种选择
val df = spark.sql(“从schema.table中选择*”
df.coalesce(1).write.options(Map(“ header”->“ true”,“ compression”->“ snappy”))。mode(SaveMode.Overwrite).parquet()
答案 1 :(得分:0)
如果您使用的是sql,则解决方法是将过滤器添加到sql:
sql("select * from BookTemp limit 5 where BookTitle!='Book-Title'").show
该吉拉人与以下物种有关:https://issues.apache.org/jira/browse/SPARK-11374
也请阅读:https://github.com/apache/spark/pull/14638-您可以使用CSV阅读器选项:
spark.read.option("header","true").csv("/data").show
或在加载之前使用shell删除标头:
file="myfile.csv"
tail -n +2 "$file" > "$file.tmp" && mv "$file.tmp" "$file"