具有数组和映射数据类型的Spark Scala案例类

时间:2018-08-14 22:15:55

标签: scala apache-spark dataframe

enter image description here我有类似的数据

[Michael, 100, Montreal,Toronto, Male,30, DB:80, Product:DeveloperLead]
[Will, 101, Montreal, Male,35, Perl:85, Product:Lead,Test:Lead]
[Steven, 102, New York, Female,27, Python:80, Test:Lead,COE:Architect]
[Lucy, 103, Vancouver, Female,57, Sales:89,HR:94, Sales:Lead]  

因此必须读取此数据并使用spark定义案例类,我在程序bt下方编写了将案例类转换为数据框时出错的代码,有人可以帮助我我的代码有什么问题如何纠正它< / p>

case class Ayush(name: String,employee_id:String ,work_place: Array[String],sex_age: Map [String,String],skills_score: Map[String,String],depart_title: Map[String,Array[String]])

在下面一行出现错误

val d = df.map(w=> Ayush(w(0),w(1),w(2)._1,w(2)._2,w(3)._1,w(3)._2,w(4)._1,w(4)._2,w(5)._1,w(5)._2._1,w(5)._2._2))).toDF  

2 个答案:

答案 0 :(得分:2)

我已更改您的数据。将工作区和部门数据包装在双倍的范围内,以便获得逗号分隔值的数据。然后添加自定义分隔符,以便以后可以使用分隔符分隔数据。您可以使用自己的分隔符。图片如下:

enter image description here

数据如下:

Michael,100,“蒙特利尔,多伦多”,男性,30,DB:80,“ Product,DeveloperLead” Will,101,蒙特利尔,男性,35,Perl:85,“产品,领先,测试,领先” 史蒂文(Steven),102,纽约,女,27,Python:80,“ Test,Lead,COE,Architect” Lucy,103,Vancouver,Female,57,Sales:89_HR:94,“ Sales,Lead”

下面是我执行的代码更改,这些更改对我来说效果很好:

SubscribeOrderResponse

以上代码输出为:

================================================ ===============================

val df = spark.read.csv("CSV PATH HERE")
case class Ayush(name: String,employee_id:String ,work_place: Array[String],sex_age: Map [String,String],skills_score: Map[String,String],depart_title: Map[String,Array[String]])
val resultDF = df.map { x => {
       val departTitleData = x(6).toString
       val skill_score = x(5).toString
       val skill_Map = scala.collection.mutable.Map[String, String]()
       // Separate skill by underscore I can get each skill:Num then i will add each one in map
      skill_score.split("_").foreach { x => skill_Map += (x.split(":")(0) -> x.split(":")(1)) }
// Putting data into case class
     new Ayush(x(0).toString(), x(1).toString, x(2).toString.split(","), Map(x(3).toString -> x(4).toString), skill_Map.toMap, Map(x(6).toString.split(",")(0) -> x(6).toString.split(",")) )
}}
//End Here
  • 这可能与您的预期不同,但可能会帮助您实现自己的尝试。...

答案 1 :(得分:0)

@vishal 我不知道这个问题是否仍然有效,但这是我在不更改源数据的情况下的解决方案,公平警告它可能有点令人讨厌:)

library(dplyr)
library(purrr)

bind_rows(merged.list) %>%
  group_split(mod %in% c('b', 'f'), .keep = FALSE) %>%
  map(~.x %>% group_by(date) %>% summarise(across(X0:X1, mean, na.rm = TRUE)))

输出:

def main(args:Array[String]):Unit= {

    val conf=new SparkConf().setAppName("first_demo").setMaster("local[*]")
    val sc=new SparkContext(conf)
   val spark=SparkSession.builder().getOrCreate()
    import spark.implicits._
    val rdd1=sc.textFile("file:///C:/Users/k.sandeep.varma/Downloads/documents/documents/spark_data/employee_data.txt")
    val clean_rdd=rdd1.map(x=>x.replace("[","")).map(x=>x.replace("]",""))
   val schema_rdd=clean_rdd.map(x=>x.split(", ")).map(x=>schema(x(0),x(1),x(2).split(","),Map(x(3).split(",")(0)->x(3).split(",")(1)),Map(x(4).split(":")(0)->x(4).split(":")(1)),Map(x(5).split(":")(0)->x(5).split(":"))))
    val df1=schema_rdd.toDF()
    df1.printSchema()
  df1.show(false)