我有一组看起来像这样的数据:
1:a:x|y|z
2:b:y|z
3:c:x
4:d:w|x
我想要的是一个看起来像这样的输出:
1,a,x
1,a,y
1,a,z
2,b,y
2,b,z
3,c,x
4,d,w
4,d,x
我试过拆分':'和'|'但它没有帮助,因为它给出了这样的结果:
1,a,x,y,z
2,b,y,z
3,c,x
4,d,w,x
另外,如果我从
中过滤掉(w,y,z),我有什么办法可以过滤掉rdd中不需要的值吗?1,a,x,y,z
2,b,y,z
3,c,x
4,d,w,x
预期输出将如下:
1,a,x
2,b, //it'll be fine if this doesn't even appear, better in fact
3,c,x
4,d,x
有什么想法吗?
答案 0 :(得分:0)
我假设唯一可以包含多个选项的列是最后一个
rdd.flatMap {
row =>
val Array(col1, col2, col3) = row.split(':')
col3.split('|').map(value => (col1, col2, value) )
}
之后,您将获得RDD[(String, String, String)]
答案 1 :(得分:0)
val df = sc.parallelize(Seq(("1:a:x|y|z"), ("2:b:y|z"), ("3:c:x"),("4:d:w|x")))
df.collect
//flat map
val df1=df.flatMap {
row =>
val Array(col1, col2, col3) = row.split(':')
col3.split('|').map(value => (col1, col2, value) )
}
df1.collect
//filer as per requirments
val df2=df1.toDF("col1","col2","col3")
df2.show()
//df2.createOrReplaceTempView("TempTable")
//val countDF = spark.sqlContext.sql("SELECT col1,col2,col3, MIN(col1) FROM TempTable GROUP BY col1,col2,col3").show()
val w = Window.partitionBy($"col1").orderBy($"col1".desc)
val dfTop = df2.withColumn("rn", row_number.over(w)).where($"rn" === 1).drop("rn").orderBy($"col1".asc)
dfTop.show
结果:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a| x|
| 1| a| y|
| 1| a| z|
| 2| b| y|
| 2| b| z|
| 3| c| x|
| 4| d| w|
| 4| d| x|
+----+----+----+
过滤后
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a| x|
| 2| b| y|
| 3| c| x|
| 4| d| w|
+----+----+----+
df: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[88] at parallelize at command-4102047427741428:3
df1: org.apache.spark.rdd.RDD[(String, String, String)] = MapPartitionsRDD[89] at flatMap at command-4102047427741428:6
df2: org.apache.spark.sql.DataFrame = [col1: string, col2: string ... 1 more field]
w: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@31cd5345
dfTop: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [col1: string, col2: string ... 1 more field]
Command took 1.01 seconds -- by vaquar.khan@gmail.com at 11/11/2017, 10:35:31 PM on My Cluster
答案 2 :(得分:0)
第一步是每行的拆分:
AJAX
然后生成组合:
"1:a:x|y|z".split(':').toList.map(_.split('|').toList)
----
in: List[List[String]] = List(List(1), List(a), List(x, y, z))
然后,如果需要,您可以 def combine(input: List[List[String]]): List[List[String]] = input match {
case x :: xs => x.flatMap(s => combine(xs).map(s :: _))
case Nil => List(Nil)
}
val res = combine(in).map(_.mkString(","))
----
res: List[String] = List(1,a,x, 1,a,y, 1,a,z)
:
.filter