示例字符串:
"Canada,0,,0,,,1,,,,,\"From: \"\"nitesh\"\" <nitesh@abc.com>\",Sub: RE: X Support Notification - Service Request #<4-20659465477> has been created.,\"To: \"\"'ABC Update'\"\" <support_reply@xyz.com>,; .<sunny@bcd.com>,; .<anchit@xyz.com>; \",,0,0,0,0,0,"
我必须使用Spark在Scala中解析此类字符串。
我在,
符号处拆分字符串,直到在示例字符串中看到的字符串字段之间收到,
符号为止。
当前我正在使用Scala 2.11.12和Spark 2.4.3
由于我是Scala和一般编程的新手,所以有人可以在这里提供编码部分吗?
谢谢
答案 0 :(得分:0)
如果您想将rdd
的字符串转换为dataframe
,请尝试以下操作,它将为您提供帮助。
val rddStrs= spark.sparkContext.parallelize(List("Canada,0,,0,,,1,,,,,\"From: \"\"nitesh\"\" <nitesh@abc.com>\",Sub: RE: X Support Notification - Service Request #<4-20659465477> has been created.,\"To: \"\"'ABC Update'\"\" <support_reply@xyz.com>,; .<sunny@bcd.com>,; .<anchit@xyz.com>; \",,0,0,0,0,0,"))
val colName=List("start","from","subject","to","last")
val df=spark.createDataFrame(rddStrs.map(temp=>{
val str=temp.replaceAll(",{2,}", ",").replace("\"","")
val multipleCommaRemoved=str.replaceAll(",{2,}", ",").replace("\"","")
val indexOfFrom=multipleCommaRemoved.indexOf("From:")
val indexOfSub=multipleCommaRemoved.indexOf("Sub:")
val indexOfTo=multipleCommaRemoved.indexOf("To:")
val lastIndex=multipleCommaRemoved.lastIndexOf(";")
val start=multipleCommaRemoved.substring(0,indexOfFrom)
val from=multipleCommaRemoved.substring(indexOfFrom,indexOfSub)
val subject=multipleCommaRemoved.substring(indexOfSub,indexOfTo)
val to=multipleCommaRemoved.substring(indexOfTo,lastIndex)
val last=multipleCommaRemoved.substring(lastIndex+1).trim()
(start,from,subject,to,last)
})).toDF(colName:_*)
df.show()
//Sample output
+-------------+--------------------+--------------------+--------------------+-----------+
| start| from| subject| to| last|
+-------------+--------------------+--------------------+--------------------+-----------+
|Canada,0,0,1,|From: nitesh <nit...|Sub: RE: X Suppor...|To: 'ABC Update' ...|,0,0,0,0,0,|
+-------------+--------------------+--------------------+--------------------+-----------+