scala用于分解特定数组中的数据以提取列的函数

时间:2018-05-08 05:02:11

标签: scala apache-spark apache-spark-sql spark-dataframe user-defined-functions

示例行:

  

{"扩大":"名称,模式"" centralid":10," centralloc":" balh&# 34;,组分:" com.atlassian.greenhopper.service.sprint.Sprint@89322d3 [ID = 123,rapidViewId = 321,状态= CLOSED,姓名=冲刺   30 -   \" ABC \",的startDate = 2018-03-09T16:04:40.666 + 11:00,结束日期= 2018-03-23T16:04:00.000 + 11:00,completeDate = 2018-03- 23T14:12:44.680 + 11:00,序列= 980]"" com.atlassian.greenhopper.service.sprint.Sprint@42e71215 [ID = 456,rapidViewId = 654,状态= CLOSED,名称=冲刺   31 -   \" ABC \",的startDate = 2018-03-23T14:57:17.889 + 11:00,结束日期= 2018-04-06T14:57:00.000 + 10:00,completeDate = 2018-04- 06T15:05:27.638 + 10:00,序列= 974]"" com.atlassian.greenhopper.service.sprint.Sprint@226753d [ID = 789,rapidViewId = 987,状态= CLOSED,名称=冲刺   32 -   \" XYZ \",的startDate = 2018-04-06T15:43:52.118 + 10:00,结束日期= 2018-04-20T15:43:00.000 + 10:00,completeDate = 2018-04- 20T14:06:26.892 + 10:00,序列= 990]"" com.atlassian.greenhopper.service.sprint.Sprint@74bcf2de [ID = 101,rapidViewId = 101,状态= CLOSED,名称=冲刺   33 -   \" lmnop \",的startDate = 2018-04-20T15:54:01.418 + 10:00,结束日期= 2018-05-04T15:54:00.000 + 10:00,completeDate = 2018-05- 04T15:06:45.374 + 10:00,序列= 999]"]

我希望在线上爆炸成为单独的行并在列下面提取。我正在使用sparksql爆炸并获得如下输出。将对组件中的每一行重复centralid,centralloc列

centralid, centralloc, components.id , components.rapidViewId, components.state, components.name, components.startDate, components.endDate, components.completeDate, components.sequence

请分享您的想法。你可以尝试使用正则表达式,并让我知道如何

编辑以满足要求。

3 个答案:

答案 0 :(得分:1)

这是您可以尝试的方法之一。 假设您将输出作为String列表。

val input = List("com.atlassian.greenhopper.service.sprint.Sprint@89322d3[id=123,rapidViewId=321,state=CLOSED,name=Sprint 30 - \"abc\",startDate=2018-03-09T16:04:40.666+11:00,endDate=2018-03-23T16:04:00.000+11:00,completeDate=2018-03-23T14:12:44.680+11:00,sequence=980]",

  "com.atlassian.greenhopper.service.sprint.Sprint@42e71215[id=456,rapidViewId=654,state=CLOSED,name=Sprint 31 - \"abc\",startDate=2018-03-23T14:57:17.889+11:00,endDate=2018-04-06T14:57:00.000+10:00,completeDate=2018-04-06T15:05:27.638+10:00,sequence=974]",

  "com.atlassian.greenhopper.service.sprint.Sprint@226753d[id=789,rapidViewId=987,state=CLOSED,name=Sprint 32 - \"xyz\",startDate=2018-04-06T15:43:52.118+10:00,endDate=2018-04-20T15:43:00.000+10:00,completeDate=2018-04-20T14:06:26.892+10:00,sequence=990]",

  "com.atlassian.greenhopper.service.sprint.Sprint@74bcf2de[id=101,rapidViewId=101,state=CLOSED,name=Sprint 33 - \"lmnop\",startDate=2018-04-20T15:54:01.418+10:00,endDate=2018-05-04T15:54:00.000+10:00,completeDate=2018-05-04T15:06:45.374+10:00,sequence=999]")

您可以在scala中声明一个case类来从中提取certail字段。

case class Output(id:String,rapidViewId:String, state:String, name:String, startDate:String, endDate:String, completeDate:String, sequence:String)

您现在可以获取最终案例类列表,您可以从中提取所需的字段。

val result = input.map{
  x =>
    val intermediateResult = x.split("\\[")(1).split("\\,")
    Output(intermediateResult(0),intermediateResult(1),intermediateResult(2),intermediateResult(3),intermediateResult(4),intermediateResult(5),intermediateResult(6),intermediateResult(7).replaceAll("\\]",""))

}

您将以

格式获得结果
result: List[Output] = List(Output(id=123,rapidViewId=321,state=CLOSED,name=Sprint 30 - "abc",startDate=2018-03-09T16:04:40.666+11:00,endDate=2018-03-23T16:04:00.000+11:00,completeDate=2018-03-23T14:12:44.680+11:00,sequence=980), Output(id=456,rapidViewId=654,state=CLOSED,name=Sprint 31 - "abc",startDate=2018-03-23T14:57:17.889+11:00,endDate=2018-04-06T14:57:00.000+10:00,completeDate=2018-04-06T15:05:27.638+10:00,sequence=974), Output(id=789,rapidViewId=987,state=CLOSED,name=Sprint 32 - "xyz",startDate=2018-04-06T15:43:52.118+10:00,endDate=2018-04-20T15:43:00.000+10:00,completeDate=2018-04-20T14:06:26.892+10:00,sequence=990), Output(id=101,rapidViewId=101,state=CLOSED,name=Sprint 33 - "lmnop",startDate=2018-04-20T15:54:01.418+10:00,endDate=2018-05-04T15:54:00.000+10:00,completeDate=2018-05-04T15:06:45.374+10:00,sequence=999))

你可以从中提取你需要的人。

这是您可以使用的方法之一。如果您有任何疑问,请告诉我。我很乐意澄清它们。

答案 1 :(得分:1)

以下是一些供您参考的代码,您可能希望使用正则表达式来改进它:

val c = """["com.atlassian.greenhopper.service.sprint.Sprint@89322d3[id=123,rapidViewId=321,state=CLOSED,name=Sprint 30 - \"abc\",startDate=2018-03-09T16:04:40.666+11:00,endDate=2018-03-23T16:04:00.000+11:00,completeDate=2018-03-23T14:12:44.680+11:00,sequence=980]","com.atlassian.greenhopper.service.sprint.Sprint@42e71215[id=456,rapidViewId=654,state=CLOSED,name=Sprint 31 - \"abc\",startDate=2018-03-23T14:57:17.889+11:00,endDate=2018-04-06T14:57:00.000+10:00,completeDate=2018-04-06T15:05:27.638+10:00,sequence=974]","com.atlassian.greenhopper.service.sprint.Sprint@226753d[id=789,rapidViewId=987,state=CLOSED,name=Sprint 32 - \"xyz\",startDate=2018-04-06T15:43:52.118+10:00,endDate=2018-04-20T15:43:00.000+10:00,completeDate=2018-04-20T14:06:26.892+10:00,sequence=990]","com.atlassian.greenhopper.service.sprint.Sprint@74bcf2de[id=101,rapidViewId=101,state=CLOSED,name=Sprint 33 - \"lmnop\",startDate=2018-04-20T15:54:01.418+10:00,endDate=2018-05-04T15:54:00.000+10:00,completeDate=2018-05-04T15:06:45.374+10:00,sequence=999]"]"""

c.split(",").
  flatMap(_.split('[')).
  flatMap(_.split(']')).
  filter(_.contains('=')).
  map(_.split('=')(1)).
  grouped(8).
  map(_.toList).
  foreach(println)

输出如下:

List(456, 654, CLOSED, Sprint 31 - \"abc\", 2018-03-23T14:57:17.889+11:00, 2018-04-06T14:57:00.000+10:00, 2018-04-06T15:05:27.638+10:00, 974)
List(789, 987, CLOSED, Sprint 32 - \"xyz\", 2018-04-06T15:43:52.118+10:00, 2018-04-20T15:43:00.000+10:00, 2018-04-20T14:06:26.892+10:00, 990)
List(101, 101, CLOSED, Sprint 33 - \"lmnop\", 2018-04-20T15:54:01.418+10:00, 2018-05-04T15:54:00.000+10:00, 2018-05-04T15:06:45.374+10:00, 999)

答案 2 :(得分:1)

请查看评论以获得解释。

//string definition as in the question
val str = """["com.atlassian.greenhopper.service.sprint.Sprint@89322d3[id=123,rapidViewId=321,state=CLOSED,name=Sprint 30 - \"abc\",startDate=2018-03-09T16:04:40.666+11:00,endDate=2018-03-23T16:04:00.000+11:00,completeDate=2018-03-23T14:12:44.680+11:00,sequence=980]","com.atlassian.greenhopper.service.sprint.Sprint@42e71215[id=456,rapidViewId=654,state=CLOSED,name=Sprint 31 - \"abc\",startDate=2018-03-23T14:57:17.889+11:00,endDate=2018-04-06T14:57:00.000+10:00,completeDate=2018-04-06T15:05:27.638+10:00,sequence=974]","com.atlassian.greenhopper.service.sprint.Sprint@226753d[id=789,rapidViewId=987,state=CLOSED,name=Sprint 32 - \"xyz\",startDate=2018-04-06T15:43:52.118+10:00,endDate=2018-04-20T15:43:00.000+10:00,completeDate=2018-04-20T14:06:26.892+10:00,sequence=990]","com.atlassian.greenhopper.service.sprint.Sprint@74bcf2de[id=101,rapidViewId=101,state=CLOSED,name=Sprint 33 - \"lmnop\",startDate=2018-04-20T15:54:01.418+10:00,endDate=2018-05-04T15:54:00.000+10:00,completeDate=2018-05-04T15:06:45.374+10:00,sequence=999]"]"""
//parsing above string to get line by line data
val parsed = str.split("\",\"").map(line => line.substring(line.indexOf("[id="), line.length).replace("\"]", "").replaceAll("[\\[\\]]", ""))
//taking one line and forming a schema with field names before = sign
val schema = StructType(parsed(0).split(",").map(field => StructField(field.split("=")(0), StringType, true)))
//converting the parsed string to rdd by taking the values after = sign
val rdd = sc.parallelize(parsed.map(line => Row.fromSeq(line.split(",").map(field => field.split("=")(1)))))
//finally creating the desired dataframe
val df = sqlContext.createDataFrame(rdd, schema)
df.show(false)

应该给你

+---+-----------+------+---------------------+-----------------------------+-----------------------------+-----------------------------+--------+
|id |rapidViewId|state |name                 |startDate                    |endDate                      |completeDate                 |sequence|
+---+-----------+------+---------------------+-----------------------------+-----------------------------+-----------------------------+--------+
|123|321        |CLOSED|Sprint 30 - \"abc\"  |2018-03-09T16:04:40.666+11:00|2018-03-23T16:04:00.000+11:00|2018-03-23T14:12:44.680+11:00|980     |
|456|654        |CLOSED|Sprint 31 - \"abc\"  |2018-03-23T14:57:17.889+11:00|2018-04-06T14:57:00.000+10:00|2018-04-06T15:05:27.638+10:00|974     |
|789|987        |CLOSED|Sprint 32 - \"xyz\"  |2018-04-06T15:43:52.118+10:00|2018-04-20T15:43:00.000+10:00|2018-04-20T14:06:26.892+10:00|990     |
|101|101        |CLOSED|Sprint 33 - \"lmnop\"|2018-04-20T15:54:01.418+10:00|2018-05-04T15:54:00.000+10:00|2018-05-04T15:06:45.374+10:00|999     |
+---+-----------+------+---------------------+-----------------------------+-----------------------------+-----------------------------+--------+

更新

由于您使用新输入字符串并使用新标题更新了问题,因此您必须调整一些更改以上提议的指南可以如下

//string definition as in the question
val str = """{"expand":"names,schema","centralid":10,"centralloc":"balh",components:["com.atlassian.greenhopper.service.sprint.Sprint@89322d3[id=123,rapidViewId=321,state=CLOSED,name=Sprint 30 - \"abc\",startDate=2018-03-09T16:04:40.666+11:00,endDate=2018-03-23T16:04:00.000+11:00,completeDate=2018-03-23T14:12:44.680+11:00,sequence=980]","com.atlassian.greenhopper.service.sprint.Sprint@42e71215[id=456,rapidViewId=654,state=CLOSED,name=Sprint 31 - \"abc\",startDate=2018-03-23T14:57:17.889+11:00,endDate=2018-04-06T14:57:00.000+10:00,completeDate=2018-04-06T15:05:27.638+10:00,sequence=974]","com.atlassian.greenhopper.service.sprint.Sprint@226753d[id=789,rapidViewId=987,state=CLOSED,name=Sprint 32 - \"xyz\",startDate=2018-04-06T15:43:52.118+10:00,endDate=2018-04-20T15:43:00.000+10:00,completeDate=2018-04-20T14:06:26.892+10:00,sequence=990]","com.atlassian.greenhopper.service.sprint.Sprint@74bcf2de[id=101,rapidViewId=101,state=CLOSED,name=Sprint 33 - \"lmnop\",startDate=2018-04-20T15:54:01.418+10:00,endDate=2018-05-04T15:54:00.000+10:00,completeDate=2018-05-04T15:06:45.374+10:00,sequence=999]"]"""

val initialParsing = str.split(":\\[")
//parsing above string to get line by line data
val parsed = initialParsing(1).split("\",\"").map(line => {
  val initialSplitted = initialParsing(0).split(",")
  Seq(initialSplitted(2).replace(":", "="), initialSplitted(3).replace(":", "=")) ++ line.substring(line.indexOf("[id="), line.length).replace("\"]", "").replaceAll("[\\[\\]]", "").split(",").map(initialSplitted(4)+"."+_)
})
//taking one line and forming a schema with field names before = sign
val schema = StructType(parsed(0).map(field => StructField(field.split("=")(0).replace("\"", ""), StringType, true)))
//converting the parsed string to rdd by taking the values after = sign
val rdd = sc.parallelize(parsed.map(line => Row.fromSeq(line.map(field => field.split("=")(1)))))
//finally creating the desired dataframe
val df = sqlContext.createDataFrame(rdd, schema)
df.show(false)

应该给你

+---------+----------+-------------+----------------------+----------------+---------------------+-----------------------------+-----------------------------+-----------------------------+-------------------+
|centralid|centralloc|components.id|components.rapidViewId|components.state|components.name      |components.startDate         |components.endDate           |components.completeDate      |components.sequence|
+---------+----------+-------------+----------------------+----------------+---------------------+-----------------------------+-----------------------------+-----------------------------+-------------------+
|10       |"balh"    |123          |321                   |CLOSED          |Sprint 30 - \"abc\"  |2018-03-09T16:04:40.666+11:00|2018-03-23T16:04:00.000+11:00|2018-03-23T14:12:44.680+11:00|980                |
|10       |"balh"    |456          |654                   |CLOSED          |Sprint 31 - \"abc\"  |2018-03-23T14:57:17.889+11:00|2018-04-06T14:57:00.000+10:00|2018-04-06T15:05:27.638+10:00|974                |
|10       |"balh"    |789          |987                   |CLOSED          |Sprint 32 - \"xyz\"  |2018-04-06T15:43:52.118+10:00|2018-04-20T15:43:00.000+10:00|2018-04-20T14:06:26.892+10:00|990                |
|10       |"balh"    |101          |101                   |CLOSED          |Sprint 33 - \"lmnop\"|2018-04-20T15:54:01.418+10:00|2018-05-04T15:54:00.000+10:00|2018-05-04T15:06:45.374+10:00|999                |
+---------+----------+-------------+----------------------+----------------+---------------------+-----------------------------+-----------------------------+-----------------------------+-------------------+

我希望答案有助于指导您完成其余的工作。