我的示例数据如下所示
{ Line 1
Line 2
Line 3
Line 4
...
...
...
Line 6
Complete info:
Dept : HR
Emp name is Andrew lives in Colorodo
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : Retail
DOJ : 11/04/2011
DOL : 08/21/2013
Project name : Audit
DOJ : 09/11/2013
DOL : 09/01/2014
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
Emp name is Alex lives in Texas
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
Emp name is Mathew lives in California
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : Retail
DOJ : 11/04/2011
DOL : 08/21/2013
Project name : Audit
DOJ : 09/11/2013
DOL : 09/01/2014
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
Dept : QC
Emp name is Nguyen lives in Nevada
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : Retail
DOJ : 11/04/2011
DOL : 08/21/2013
Project name : Audit
DOJ : 09/11/2013
DOL : 09/01/2014
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
Emp name is Cassey lives in Newyork
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
Emp name is Ronney lives in Alasca
DOB : 03/09/1958
Project name : Audit
DOJ : 09/11/2013
DOL : 09/01/2014
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
line21
line22
line23
...
}
我需要的输出;
{
Dept Empname State Dob Projectname DOJ DOE
HR Andrew Colorodo 03/09/1958 Healthcare 06/04/2011 09/21/2011
HR Andrew Colorodo 03/09/1958 Retail 11/04/2011 08/21/2013
HR Andrew Colorodo 03/09/1958 Audit 09/11/2013 09/01/2014
HR Andrew Colorodo 03/09/1958 ControlManagement 06/04/2011 09/21/2011
HR Alex Texas 03/09/1958 Healthcare 06/04/2011 09/21/2011
HR Alex Texas 03/09/1958 ControlManagement 06/04/2011 09/21/2011
HR Mathews California 03/09/1958 Healthcare 06/04/2011 09/21/2011
HR Mathews California 03/09/1958 Retail 11/04/2011 08/21/2013
HR Mathews California 03/09/1958 Audit 09/11/2013 09/01/2014
HR Mathews California 03/09/1958 ControlManagement 06/04/2011 09/21/2011
QC Nguyen Nevada 03/09/1958 Healthcare 06/04/2011 09/21/2011
QC Nguyen Nevada 03/09/1958 Retail 11/04/2011 08/21/2013
QC Nguyen Nevada 03/09/1958 Audit 09/11/2013 09/01/2014
QC Nguyen Nevada 03/09/1958 ControlManagement 06/04/2011 09/21/2011
QC Casey Newyork 03/09/1958 Healthcare 06/04/2011 09/21/2011
QC Casey Newyork 03/09/1958 Retail 11/04/2011 08/21/2013
QC Casey Newyork 03/09/1958 Audit 09/11/2013 09/01/2014
QC Casey Newyork 03/09/1958 ControlManagement 06/04/2011 09/21/2011}
我尝试过以下选项: 1)想在地图里面使用地图然后去匹配。有这么多错误。然后从这里读一篇文章,这解释了我的地图里面不能有另一张地图。事实上,没有Rdd转换可以在另一个内部完成。抱歉。新手到Spark。
2)尝试使用reg表达式。然后在捕获的组上调用地图。但由于每个部门都有多个emp,每个员工都有多个项目信息,因此我无法重复对该部分数据进行分组,也无法与相应的员工进行映射。员工和部门的细节也是如此。
Q1:甚至可以将上面的示例数据转换为Spark / Scala中的上述数据格式。
Q2:如果是这样,那就是我不愿意追求的逻辑/概念?
提前致谢。
答案 0 :(得分:1)
A: Yes. If the records where more granular, I would suggest using a multi-line approach like discussed in this question: How to process multi line input records in Spark
But, given that in the data "Dept" holds large amounts of data, I wouldn't recommend it.
A2: This kind of linear processing, where there's a state being built as we traverse the lines, is better approached using a iterator or stream-based implementation:
We consume line per line, and produce records only when those are complete. The context is preserved in some state. With this approach, it really doesn't matter how big the file is, as the memory requirements are limited to the size of one record + the overhead of the state handling.
Here's a working example on how to deal with it using an iterator using plain Scala:
case class EmployeeRecord(dept: String, name: String, location: String, dob: String, project: String, joined: String, left: String) {
def toCSV = this.productIterator.mkString(", ")
}
class EmployeeParser() {
var currentStack : Map[String, String] = Map()
val (dept, name, location, birthdate, project, joined, left) = ("dept", "name", "location", "birthdate", "project", "joined", "left")
val keySequence = Seq(dept, name, location, birthdate, project, joined, left)
val ParseKeys = Map("Project name" -> project, "DOJ" -> joined, "DOL" -> left, "DOB" -> birthdate, "Dept" -> dept)
val keySet = Set(keySequence)
def clearDependencies(key: String) : Unit = {
val keepKeys = keySequence.dropWhile(k => k != key).toSet
currentStack = currentStack.filterKeys(k => !keepKeys.contains(k))
}
def isValidEntry(key: String) : Boolean = {
val precedents = keySequence.takeWhile(k => k != key).drop(1)
precedents.forall(k => currentStack.contains(k))
}
def add(key:String, value:String): Option[Unit] = {
if (!isValidEntry(key)) None else {
clearDependencies(key)
currentStack = currentStack + (key -> value)
Some(())
}
}
def record: Option[EmployeeRecord] =
for {
_dept <- currentStack.get(dept)
_name <- currentStack.get(name)
_location <- currentStack.get(location)
_dob <- currentStack.get(birthdate)
_project <- currentStack.get(project)
_joined <- currentStack.get(joined)
_left <- currentStack.get(left)
} yield EmployeeRecord(_dept, _name, _location, _dob, _project,_joined, _left)
val EmpRegex = "^Emp name is (.*) lives in (.*)$".r
def parse(line:String):Option[EmployeeRecord] = {
if (line.startsWith("Emp")) { // have to deal with that inconsistency in a different way than using keys
val maybeEmp = Option(line).map{case EmpRegex(n,l) => (n,l)}
.foreach{case (n,l) => add(name, n) ; add(location, l)}
None
} else {
val entry = line.split(":").map(_.trim)
for { entryKey <- entry.lift(0)
entryValue <- entry.lift(1)
key <- ParseKeys.get(entryKey)
_ <- add(key, entryValue)
rec <- record
} yield rec
}
}
}
To use it, we instantiate the parser and apply it to an iterator:
val iterator = Source.fromFile(...).getLines
val parser = new EmployeeParser()
val parsedRecords = iterator.map(parser.parse).collect{case Some(record) => record}
val parsedCSV = parsedRecords.map(rec => rec.toCSV)
parsedCSV.foreach(line => // write to destination file)