在两个不同的分隔符

时间:2017-11-11 20:51:22

标签: scala

我有一组看起来像这样的数据:

1:a:x|y|z
2:b:y|z
3:c:x
4:d:w|x

我想要的是一个看起来像这样的输出:

1,a,x
1,a,y
1,a,z
2,b,y
2,b,z
3,c,x
4,d,w
4,d,x

我试过拆分':'和'|'但它没有帮助,因为它给出了这样的结果:

1,a,x,y,z
2,b,y,z
3,c,x
4,d,w,x

另外,如果我从

中过滤掉(w,y,z),我有什么办法可以过滤掉rdd中不需要的值吗?
1,a,x,y,z
2,b,y,z
3,c,x
4,d,w,x

预期输出将如下:

1,a,x
2,b,     //it'll be fine if this doesn't even appear, better in fact
3,c,x
4,d,x

有什么想法吗?

3 个答案:

答案 0 :(得分:0)

我假设唯一可以包含多个选项的列是最后一个

rdd.flatMap {
  row => 
   val Array(col1, col2, col3) = row.split(':')
   col3.split('|').map(value => (col1, col2, value) )
}

之后,您将获得RDD[(String, String, String)]

答案 1 :(得分:0)

val df = sc.parallelize(Seq(("1:a:x|y|z"), ("2:b:y|z"),  ("3:c:x"),("4:d:w|x")))
 df.collect
//flat map 
val df1=df.flatMap {
  row => 
   val Array(col1, col2, col3) = row.split(':')
   col3.split('|').map(value => (col1, col2, value) )
}

df1.collect
//filer as per requirments
val df2=df1.toDF("col1","col2","col3")
df2.show()
//df2.createOrReplaceTempView("TempTable")
//val countDF = spark.sqlContext.sql("SELECT col1,col2,col3, MIN(col1) FROM TempTable GROUP BY  col1,col2,col3").show()

val w = Window.partitionBy($"col1").orderBy($"col1".desc)

val dfTop = df2.withColumn("rn", row_number.over(w)).where($"rn" === 1).drop("rn").orderBy($"col1".asc)

dfTop.show

结果:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   a|   x|
|   1|   a|   y|
|   1|   a|   z|
|   2|   b|   y|
|   2|   b|   z|
|   3|   c|   x|
|   4|   d|   w|
|   4|   d|   x|
+----+----+----+

过滤后

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   a|   x|
|   2|   b|   y|
|   3|   c|   x|
|   4|   d|   w|
+----+----+----+

df: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[88] at parallelize at command-4102047427741428:3
df1: org.apache.spark.rdd.RDD[(String, String, String)] = MapPartitionsRDD[89] at flatMap at command-4102047427741428:6
df2: org.apache.spark.sql.DataFrame = [col1: string, col2: string ... 1 more field]
w: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@31cd5345
dfTop: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [col1: string, col2: string ... 1 more field]
Command took 1.01 seconds -- by vaquar.khan@gmail.com at 11/11/2017, 10:35:31 PM on My Cluster

答案 2 :(得分:0)

第一步是每行的拆分:

AJAX

然后生成组合:

"1:a:x|y|z".split(':').toList.map(_.split('|').toList)

----
in: List[List[String]] = List(List(1), List(a), List(x, y, z))

然后,如果需要,您可以 def combine(input: List[List[String]]): List[List[String]] = input match { case x :: xs => x.flatMap(s => combine(xs).map(s :: _)) case Nil => List(Nil) } val res = combine(in).map(_.mkString(",")) ---- res: List[String] = List(1,a,x, 1,a,y, 1,a,z)

.filter