我有一个文本字符串,格式如下:
"1","1st",1,"Allen, Miss Elisabeth Walton",29.0000,"Southampton","St Louis, MO","B-5","24160 L221","2","female"
我想用逗号(,)分割字符串,但忽略双引号内的逗号(,)("")。我使用Spark和Scala以及case类来创建数据帧。 我尝试了下面的代码但是我收到了一个错误:
val tit_rdd = td.map(td=>td.split(",(?=([^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)")).map(td=>tit(td(0).replaceAll("\"","").toInt ,
td(1).replaceAll("\"",""),
td(2).toInt,
td(3).replaceAll("\"",""),
td(4).toDouble,
td(5).replaceAll("\"",""),
td(6).replaceAll("\"",""),
td(7).replaceAll("\"",""),
td(8).replaceAll("\"",""),
td(9).replaceAll("\"","").toInt,
td(10).replaceAll("\"","")))
案例类代码如下:
case class tit (Num: Int, Class: String, Survival_Code: Int, Name: String, Age: Double, Province: String, Address: String, Coach_No: String, Coach_ID: String, Floor_No:Int, Gender:String)
错误:
17/05/21 14:52:39 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.NumberFormatException: For input string: ""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:592)
at java.lang.Integer.parseInt(Integer.java:615)
at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at $line27.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2.apply(<console>:40)
at $line27.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2.apply(<console>:31)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
答案 0 :(得分:2)
NumberFormatException
是由于您的数据中的空号,并且您尝试使用Integer
将其转换为.toInt
解决方法是使用Try
和getOrElse
,如下所示
val tit_rdd = td.map(td=>td.split(",(?=([^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)"))
.map(td=>tit(Try(td(0).replaceAll("\"","").toInt) getOrElse 0 ,
td(1).replaceAll("\"",""),
Try(td(2).toInt) getOrElse 0,
td(3).replaceAll("\"",""),
Try(td(4).toDouble) getOrElse 0.0,
td(5).replaceAll("\"",""),
td(6).replaceAll("\"",""),
td(7).replaceAll("\"",""),
td(8).replaceAll("\"",""),
Try(td(9).replaceAll("\"","").toInt) getOrElse 0,
td(10).replaceAll("\"","")))
这应解决问题
将文本文件转换为dataFrame
的另一种方法是使用databricks csv reader
sqlContext.read.format("com.databricks.spark.csv").load("path to the text file")
这将生成默认header names
,例如_c0
,_c1
你可以做的是将header line
放在你的文本文件中,并在上面的行中定义option
sqlContext.read.format("com.databricks.spark.csv").option("header", true).load("path to the text file")
您可以自己玩更多选项
答案 1 :(得分:0)
您应该使用Spark的内置csv reader。
答案 2 :(得分:0)
您可以使用Spark-CSV加载csv
数据,它处理双引号内的所有逗号。
以下是如何使用它
import org.apache.spark.sql.Encoders
val spark =
SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val titschema = Encoders.product[tit].schema
val dfList = spark.read.schema(schema = titschema).csv("data.csv").as[tit]
dfList.show()
case class tit(Num: Int,
Class: String,
Survival_Code: Int,
Name: String,
Age: Double,
Province: String,
Address: String,
Coach_No: String,
Coach_ID: String,
Floor_No: Int,
Gender: String)
我希望这有帮助!
如果要创建与SQLContext.createDataFrame相同的模式 您可以使用Scala Reflection作为
import org.apache.spark.sql.catalyst.ScalaReflection
val titschema = ScalaReflection.schemaFor[tit].dataType.asInstanceOf[StructType]
答案 3 :(得分:0)
我希望这可以帮助你,首先将所有“,”(可拆分)替换为“#”,然后用“#”代替。
scala> st.replace("""","""", "#").replace("""",""","#").replace(""","""", "#").replace(""""""", "").split("#").map("\"" + _ + "\"")
res1: Array[String] = Array("1", "1st", "1", "Allen, Miss Elisabeth Walton", "29.0000", "Southampton", "St Louis, MO", "B-5", "24160 L221", "2", "female")
scala> res1.size
res2: Int = 11