示例行:
{"扩大":"名称,模式"" centralid":10," centralloc":" balh&# 34;,组分:" com.atlassian.greenhopper.service.sprint.Sprint@89322d3 [ID = 123,rapidViewId = 321,状态= CLOSED,姓名=冲刺 30 - \" ABC \",的startDate = 2018-03-09T16:04:40.666 + 11:00,结束日期= 2018-03-23T16:04:00.000 + 11:00,completeDate = 2018-03- 23T14:12:44.680 + 11:00,序列= 980]"" com.atlassian.greenhopper.service.sprint.Sprint@42e71215 [ID = 456,rapidViewId = 654,状态= CLOSED,名称=冲刺 31 - \" ABC \",的startDate = 2018-03-23T14:57:17.889 + 11:00,结束日期= 2018-04-06T14:57:00.000 + 10:00,completeDate = 2018-04- 06T15:05:27.638 + 10:00,序列= 974]"" com.atlassian.greenhopper.service.sprint.Sprint@226753d [ID = 789,rapidViewId = 987,状态= CLOSED,名称=冲刺 32 - \" XYZ \",的startDate = 2018-04-06T15:43:52.118 + 10:00,结束日期= 2018-04-20T15:43:00.000 + 10:00,completeDate = 2018-04- 20T14:06:26.892 + 10:00,序列= 990]"" com.atlassian.greenhopper.service.sprint.Sprint@74bcf2de [ID = 101,rapidViewId = 101,状态= CLOSED,名称=冲刺 33 - \" lmnop \",的startDate = 2018-04-20T15:54:01.418 + 10:00,结束日期= 2018-05-04T15:54:00.000 + 10:00,completeDate = 2018-05- 04T15:06:45.374 + 10:00,序列= 999]"]
我希望在线上爆炸成为单独的行并在列下面提取。我正在使用sparksql爆炸并获得如下输出。将对组件中的每一行重复centralid,centralloc列
centralid, centralloc, components.id , components.rapidViewId, components.state, components.name, components.startDate, components.endDate, components.completeDate, components.sequence
请分享您的想法。你可以尝试使用正则表达式,并让我知道如何
编辑以满足要求。
答案 0 :(得分:1)
这是您可以尝试的方法之一。 假设您将输出作为String列表。
val input = List("com.atlassian.greenhopper.service.sprint.Sprint@89322d3[id=123,rapidViewId=321,state=CLOSED,name=Sprint 30 - \"abc\",startDate=2018-03-09T16:04:40.666+11:00,endDate=2018-03-23T16:04:00.000+11:00,completeDate=2018-03-23T14:12:44.680+11:00,sequence=980]",
"com.atlassian.greenhopper.service.sprint.Sprint@42e71215[id=456,rapidViewId=654,state=CLOSED,name=Sprint 31 - \"abc\",startDate=2018-03-23T14:57:17.889+11:00,endDate=2018-04-06T14:57:00.000+10:00,completeDate=2018-04-06T15:05:27.638+10:00,sequence=974]",
"com.atlassian.greenhopper.service.sprint.Sprint@226753d[id=789,rapidViewId=987,state=CLOSED,name=Sprint 32 - \"xyz\",startDate=2018-04-06T15:43:52.118+10:00,endDate=2018-04-20T15:43:00.000+10:00,completeDate=2018-04-20T14:06:26.892+10:00,sequence=990]",
"com.atlassian.greenhopper.service.sprint.Sprint@74bcf2de[id=101,rapidViewId=101,state=CLOSED,name=Sprint 33 - \"lmnop\",startDate=2018-04-20T15:54:01.418+10:00,endDate=2018-05-04T15:54:00.000+10:00,completeDate=2018-05-04T15:06:45.374+10:00,sequence=999]")
您可以在scala中声明一个case类来从中提取certail字段。
case class Output(id:String,rapidViewId:String, state:String, name:String, startDate:String, endDate:String, completeDate:String, sequence:String)
您现在可以获取最终案例类列表,您可以从中提取所需的字段。
val result = input.map{
x =>
val intermediateResult = x.split("\\[")(1).split("\\,")
Output(intermediateResult(0),intermediateResult(1),intermediateResult(2),intermediateResult(3),intermediateResult(4),intermediateResult(5),intermediateResult(6),intermediateResult(7).replaceAll("\\]",""))
}
您将以
格式获得结果result: List[Output] = List(Output(id=123,rapidViewId=321,state=CLOSED,name=Sprint 30 - "abc",startDate=2018-03-09T16:04:40.666+11:00,endDate=2018-03-23T16:04:00.000+11:00,completeDate=2018-03-23T14:12:44.680+11:00,sequence=980), Output(id=456,rapidViewId=654,state=CLOSED,name=Sprint 31 - "abc",startDate=2018-03-23T14:57:17.889+11:00,endDate=2018-04-06T14:57:00.000+10:00,completeDate=2018-04-06T15:05:27.638+10:00,sequence=974), Output(id=789,rapidViewId=987,state=CLOSED,name=Sprint 32 - "xyz",startDate=2018-04-06T15:43:52.118+10:00,endDate=2018-04-20T15:43:00.000+10:00,completeDate=2018-04-20T14:06:26.892+10:00,sequence=990), Output(id=101,rapidViewId=101,state=CLOSED,name=Sprint 33 - "lmnop",startDate=2018-04-20T15:54:01.418+10:00,endDate=2018-05-04T15:54:00.000+10:00,completeDate=2018-05-04T15:06:45.374+10:00,sequence=999))
你可以从中提取你需要的人。
这是您可以使用的方法之一。如果您有任何疑问,请告诉我。我很乐意澄清它们。
答案 1 :(得分:1)
以下是一些供您参考的代码,您可能希望使用正则表达式来改进它:
val c = """["com.atlassian.greenhopper.service.sprint.Sprint@89322d3[id=123,rapidViewId=321,state=CLOSED,name=Sprint 30 - \"abc\",startDate=2018-03-09T16:04:40.666+11:00,endDate=2018-03-23T16:04:00.000+11:00,completeDate=2018-03-23T14:12:44.680+11:00,sequence=980]","com.atlassian.greenhopper.service.sprint.Sprint@42e71215[id=456,rapidViewId=654,state=CLOSED,name=Sprint 31 - \"abc\",startDate=2018-03-23T14:57:17.889+11:00,endDate=2018-04-06T14:57:00.000+10:00,completeDate=2018-04-06T15:05:27.638+10:00,sequence=974]","com.atlassian.greenhopper.service.sprint.Sprint@226753d[id=789,rapidViewId=987,state=CLOSED,name=Sprint 32 - \"xyz\",startDate=2018-04-06T15:43:52.118+10:00,endDate=2018-04-20T15:43:00.000+10:00,completeDate=2018-04-20T14:06:26.892+10:00,sequence=990]","com.atlassian.greenhopper.service.sprint.Sprint@74bcf2de[id=101,rapidViewId=101,state=CLOSED,name=Sprint 33 - \"lmnop\",startDate=2018-04-20T15:54:01.418+10:00,endDate=2018-05-04T15:54:00.000+10:00,completeDate=2018-05-04T15:06:45.374+10:00,sequence=999]"]"""
c.split(",").
flatMap(_.split('[')).
flatMap(_.split(']')).
filter(_.contains('=')).
map(_.split('=')(1)).
grouped(8).
map(_.toList).
foreach(println)
输出如下:
List(456, 654, CLOSED, Sprint 31 - \"abc\", 2018-03-23T14:57:17.889+11:00, 2018-04-06T14:57:00.000+10:00, 2018-04-06T15:05:27.638+10:00, 974)
List(789, 987, CLOSED, Sprint 32 - \"xyz\", 2018-04-06T15:43:52.118+10:00, 2018-04-20T15:43:00.000+10:00, 2018-04-20T14:06:26.892+10:00, 990)
List(101, 101, CLOSED, Sprint 33 - \"lmnop\", 2018-04-20T15:54:01.418+10:00, 2018-05-04T15:54:00.000+10:00, 2018-05-04T15:06:45.374+10:00, 999)
答案 2 :(得分:1)
请查看评论以获得解释。
//string definition as in the question
val str = """["com.atlassian.greenhopper.service.sprint.Sprint@89322d3[id=123,rapidViewId=321,state=CLOSED,name=Sprint 30 - \"abc\",startDate=2018-03-09T16:04:40.666+11:00,endDate=2018-03-23T16:04:00.000+11:00,completeDate=2018-03-23T14:12:44.680+11:00,sequence=980]","com.atlassian.greenhopper.service.sprint.Sprint@42e71215[id=456,rapidViewId=654,state=CLOSED,name=Sprint 31 - \"abc\",startDate=2018-03-23T14:57:17.889+11:00,endDate=2018-04-06T14:57:00.000+10:00,completeDate=2018-04-06T15:05:27.638+10:00,sequence=974]","com.atlassian.greenhopper.service.sprint.Sprint@226753d[id=789,rapidViewId=987,state=CLOSED,name=Sprint 32 - \"xyz\",startDate=2018-04-06T15:43:52.118+10:00,endDate=2018-04-20T15:43:00.000+10:00,completeDate=2018-04-20T14:06:26.892+10:00,sequence=990]","com.atlassian.greenhopper.service.sprint.Sprint@74bcf2de[id=101,rapidViewId=101,state=CLOSED,name=Sprint 33 - \"lmnop\",startDate=2018-04-20T15:54:01.418+10:00,endDate=2018-05-04T15:54:00.000+10:00,completeDate=2018-05-04T15:06:45.374+10:00,sequence=999]"]"""
//parsing above string to get line by line data
val parsed = str.split("\",\"").map(line => line.substring(line.indexOf("[id="), line.length).replace("\"]", "").replaceAll("[\\[\\]]", ""))
//taking one line and forming a schema with field names before = sign
val schema = StructType(parsed(0).split(",").map(field => StructField(field.split("=")(0), StringType, true)))
//converting the parsed string to rdd by taking the values after = sign
val rdd = sc.parallelize(parsed.map(line => Row.fromSeq(line.split(",").map(field => field.split("=")(1)))))
//finally creating the desired dataframe
val df = sqlContext.createDataFrame(rdd, schema)
df.show(false)
应该给你
+---+-----------+------+---------------------+-----------------------------+-----------------------------+-----------------------------+--------+
|id |rapidViewId|state |name |startDate |endDate |completeDate |sequence|
+---+-----------+------+---------------------+-----------------------------+-----------------------------+-----------------------------+--------+
|123|321 |CLOSED|Sprint 30 - \"abc\" |2018-03-09T16:04:40.666+11:00|2018-03-23T16:04:00.000+11:00|2018-03-23T14:12:44.680+11:00|980 |
|456|654 |CLOSED|Sprint 31 - \"abc\" |2018-03-23T14:57:17.889+11:00|2018-04-06T14:57:00.000+10:00|2018-04-06T15:05:27.638+10:00|974 |
|789|987 |CLOSED|Sprint 32 - \"xyz\" |2018-04-06T15:43:52.118+10:00|2018-04-20T15:43:00.000+10:00|2018-04-20T14:06:26.892+10:00|990 |
|101|101 |CLOSED|Sprint 33 - \"lmnop\"|2018-04-20T15:54:01.418+10:00|2018-05-04T15:54:00.000+10:00|2018-05-04T15:06:45.374+10:00|999 |
+---+-----------+------+---------------------+-----------------------------+-----------------------------+-----------------------------+--------+
由于您使用新输入字符串并使用新标题更新了问题,因此您必须调整一些更改以上提议的指南可以如下
//string definition as in the question
val str = """{"expand":"names,schema","centralid":10,"centralloc":"balh",components:["com.atlassian.greenhopper.service.sprint.Sprint@89322d3[id=123,rapidViewId=321,state=CLOSED,name=Sprint 30 - \"abc\",startDate=2018-03-09T16:04:40.666+11:00,endDate=2018-03-23T16:04:00.000+11:00,completeDate=2018-03-23T14:12:44.680+11:00,sequence=980]","com.atlassian.greenhopper.service.sprint.Sprint@42e71215[id=456,rapidViewId=654,state=CLOSED,name=Sprint 31 - \"abc\",startDate=2018-03-23T14:57:17.889+11:00,endDate=2018-04-06T14:57:00.000+10:00,completeDate=2018-04-06T15:05:27.638+10:00,sequence=974]","com.atlassian.greenhopper.service.sprint.Sprint@226753d[id=789,rapidViewId=987,state=CLOSED,name=Sprint 32 - \"xyz\",startDate=2018-04-06T15:43:52.118+10:00,endDate=2018-04-20T15:43:00.000+10:00,completeDate=2018-04-20T14:06:26.892+10:00,sequence=990]","com.atlassian.greenhopper.service.sprint.Sprint@74bcf2de[id=101,rapidViewId=101,state=CLOSED,name=Sprint 33 - \"lmnop\",startDate=2018-04-20T15:54:01.418+10:00,endDate=2018-05-04T15:54:00.000+10:00,completeDate=2018-05-04T15:06:45.374+10:00,sequence=999]"]"""
val initialParsing = str.split(":\\[")
//parsing above string to get line by line data
val parsed = initialParsing(1).split("\",\"").map(line => {
val initialSplitted = initialParsing(0).split(",")
Seq(initialSplitted(2).replace(":", "="), initialSplitted(3).replace(":", "=")) ++ line.substring(line.indexOf("[id="), line.length).replace("\"]", "").replaceAll("[\\[\\]]", "").split(",").map(initialSplitted(4)+"."+_)
})
//taking one line and forming a schema with field names before = sign
val schema = StructType(parsed(0).map(field => StructField(field.split("=")(0).replace("\"", ""), StringType, true)))
//converting the parsed string to rdd by taking the values after = sign
val rdd = sc.parallelize(parsed.map(line => Row.fromSeq(line.map(field => field.split("=")(1)))))
//finally creating the desired dataframe
val df = sqlContext.createDataFrame(rdd, schema)
df.show(false)
应该给你
+---------+----------+-------------+----------------------+----------------+---------------------+-----------------------------+-----------------------------+-----------------------------+-------------------+
|centralid|centralloc|components.id|components.rapidViewId|components.state|components.name |components.startDate |components.endDate |components.completeDate |components.sequence|
+---------+----------+-------------+----------------------+----------------+---------------------+-----------------------------+-----------------------------+-----------------------------+-------------------+
|10 |"balh" |123 |321 |CLOSED |Sprint 30 - \"abc\" |2018-03-09T16:04:40.666+11:00|2018-03-23T16:04:00.000+11:00|2018-03-23T14:12:44.680+11:00|980 |
|10 |"balh" |456 |654 |CLOSED |Sprint 31 - \"abc\" |2018-03-23T14:57:17.889+11:00|2018-04-06T14:57:00.000+10:00|2018-04-06T15:05:27.638+10:00|974 |
|10 |"balh" |789 |987 |CLOSED |Sprint 32 - \"xyz\" |2018-04-06T15:43:52.118+10:00|2018-04-20T15:43:00.000+10:00|2018-04-20T14:06:26.892+10:00|990 |
|10 |"balh" |101 |101 |CLOSED |Sprint 33 - \"lmnop\"|2018-04-20T15:54:01.418+10:00|2018-05-04T15:54:00.000+10:00|2018-05-04T15:06:45.374+10:00|999 |
+---------+----------+-------------+----------------------+----------------+---------------------+-----------------------------+-----------------------------+-----------------------------+-------------------+
我希望答案有助于指导您完成其余的工作。