来自Rdd Scala spark的嵌套数据

时间:2016-12-07 20:57:51

标签: scala apache-spark nested-loops

我的示例数据如下所示

{ Line 1
Line 2
Line 3
Line 4
...
...
...
Line 6



Complete info:
Dept : HR
Emp name is Andrew lives in Colorodo
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : Retail
DOJ : 11/04/2011
DOL : 08/21/2013
Project name : Audit
DOJ : 09/11/2013
DOL : 09/01/2014
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
Emp name is Alex lives in Texas
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
Emp name is Mathew lives in California
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : Retail
DOJ : 11/04/2011
DOL : 08/21/2013
Project name : Audit
DOJ : 09/11/2013
DOL : 09/01/2014
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016

Dept : QC
Emp name is Nguyen lives in Nevada
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : Retail
DOJ : 11/04/2011
DOL : 08/21/2013
Project name : Audit
DOJ : 09/11/2013
DOL : 09/01/2014
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
Emp name is Cassey lives in Newyork
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
Emp name is Ronney lives in Alasca
DOB : 03/09/1958
Project name : Audit
DOJ : 09/11/2013
DOL : 09/01/2014
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016


line21
line22
line23
...
}

我需要的输出;

{

Dept    Empname     State       Dob     Projectname         DOJ     DOE
HR  Andrew      Colorodo    03/09/1958  Healthcare      06/04/2011  09/21/2011
HR  Andrew      Colorodo    03/09/1958  Retail          11/04/2011  08/21/2013
HR  Andrew      Colorodo    03/09/1958  Audit           09/11/2013  09/01/2014
HR  Andrew      Colorodo    03/09/1958  ControlManagement   06/04/2011  09/21/2011
HR  Alex        Texas       03/09/1958  Healthcare      06/04/2011  09/21/2011
HR  Alex        Texas       03/09/1958  ControlManagement   06/04/2011  09/21/2011
HR  Mathews     California  03/09/1958  Healthcare      06/04/2011  09/21/2011
HR  Mathews     California  03/09/1958  Retail          11/04/2011  08/21/2013
HR  Mathews     California  03/09/1958  Audit           09/11/2013  09/01/2014
HR  Mathews     California  03/09/1958  ControlManagement   06/04/2011  09/21/2011
QC  Nguyen      Nevada      03/09/1958  Healthcare      06/04/2011  09/21/2011
QC  Nguyen      Nevada      03/09/1958  Retail          11/04/2011  08/21/2013
QC  Nguyen      Nevada      03/09/1958  Audit           09/11/2013  09/01/2014
QC  Nguyen      Nevada      03/09/1958  ControlManagement   06/04/2011  09/21/2011
QC  Casey       Newyork     03/09/1958  Healthcare      06/04/2011  09/21/2011
QC  Casey       Newyork     03/09/1958  Retail          11/04/2011  08/21/2013
QC  Casey       Newyork     03/09/1958  Audit           09/11/2013  09/01/2014
QC  Casey       Newyork     03/09/1958  ControlManagement   06/04/2011  09/21/2011}

我尝试过以下选项: 1)想在地图里面使用地图然后去匹配。有这么多错误。然后从这里读一篇文章,这解释了我的地图里面不能有另一张地图。事实上,没有Rdd转换可以在另一个内部完成。抱歉。新手到Spark。

2)尝试使用reg表达式。然后在捕获的组上调用地图。但由于每个部门都有多个emp,每个员工都有多个项目信息,因此我无法重复对该部分数据进行分组,也无法与相应的员工进行映射。员工和部门的细节也是如此。

Q1:甚至可以将上面的示例数据转换为Spark / Scala中的上述数据格式。

Q2:如果是这样,那就是我不愿意追求的逻辑/概念?

提前致谢。

1 个答案:

答案 0 :(得分:1)

Q1: Is it possible to convert such nested data format using Spark?

A: Yes. If the records where more granular, I would suggest using a multi-line approach like discussed in this question: How to process multi line input records in Spark

But, given that in the data "Dept" holds large amounts of data, I wouldn't recommend it.

Q2: What's the logic/ concept that I should be going after?

A2: This kind of linear processing, where there's a state being built as we traverse the lines, is better approached using a iterator or stream-based implementation:

We consume line per line, and produce records only when those are complete. The context is preserved in some state. With this approach, it really doesn't matter how big the file is, as the memory requirements are limited to the size of one record + the overhead of the state handling.

Here's a working example on how to deal with it using an iterator using plain Scala:

case class EmployeeRecord(dept: String, name: String, location: String, dob: String, project: String, joined: String, left: String) {
  def toCSV = this.productIterator.mkString(", ")
}


class EmployeeParser() {

  var currentStack : Map[String, String] = Map()

  val (dept, name, location, birthdate, project, joined, left) = ("dept", "name", "location", "birthdate", "project", "joined", "left")
  val keySequence = Seq(dept, name, location, birthdate, project, joined, left)
  val ParseKeys = Map("Project name" -> project, "DOJ" -> joined, "DOL" -> left, "DOB" -> birthdate, "Dept" -> dept)
  val keySet = Set(keySequence)

  def clearDependencies(key: String) : Unit = {
    val keepKeys = keySequence.dropWhile(k => k != key).toSet
    currentStack = currentStack.filterKeys(k => !keepKeys.contains(k))
  }

  def isValidEntry(key: String) : Boolean = {
    val precedents = keySequence.takeWhile(k => k != key).drop(1)
    precedents.forall(k => currentStack.contains(k))
  }

  def add(key:String, value:String): Option[Unit] = {
    if (!isValidEntry(key)) None else {
      clearDependencies(key)
      currentStack = currentStack + (key -> value)
      Some(())
    }
  } 

  def record: Option[EmployeeRecord] = 
    for {
      _dept <- currentStack.get(dept)
      _name <- currentStack.get(name)
      _location <- currentStack.get(location)
      _dob <- currentStack.get(birthdate)
      _project <- currentStack.get(project)
      _joined <- currentStack.get(joined)
      _left <- currentStack.get(left)
    } yield EmployeeRecord(_dept, _name, _location, _dob, _project,_joined, _left)

  val EmpRegex = "^Emp name is (.*) lives in (.*)$".r
  def parse(line:String):Option[EmployeeRecord] = {
    if (line.startsWith("Emp")) { // have to deal with that inconsistency in a different way than using keys
      val maybeEmp = Option(line).map{case EmpRegex(n,l) => (n,l)}
                                 .foreach{case (n,l) => add(name, n) ; add(location, l)}
      None
    } else {
      val entry = line.split(":").map(_.trim)
      for { entryKey <- entry.lift(0)
            entryValue <- entry.lift(1)
            key <- ParseKeys.get(entryKey)
            _ <- add(key, entryValue)
            rec <- record
          } yield rec
    }
  }
}

To use it, we instantiate the parser and apply it to an iterator:

val iterator = Source.fromFile(...).getLines
val parser = new EmployeeParser()
val parsedRecords = iterator.map(parser.parse).collect{case Some(record) => record}
val parsedCSV = parsedRecords.map(rec => rec.toCSV)
parsedCSV.foreach(line => // write to destination file)