将选定的行更改为列

时间:2019-10-07 09:39:11

标签: regex scala dataframe apache-spark pivot

我有一个具有以下结构的数据框

+------+-------------+--------+
|region|          key|     val|
+--------------------+--------+
|Sample|row1         |       6|
|Sample|row1_category|   Cat 1|
|Sample|row1_Unit    |      Kg|
|Sample|row2         |       4|
|Sample|row2_category|   Cat 2|
|Sample|row2_Unit    |     ltr|
+------+-------------+--------+

我尝试添加一列并将值从行推到列,但是类别和单位列

我想将其转换为以下结构

+------+-------------+--------+--------+--------+
|region|          key|     val|Category|   Unit |
+--------------------+--------+--------+--------+
|Sample|row1         |       6|   Cat 1|      Kg|
|Sample|row2         |       4|   Cat 2|     ltr|
+------+-------------+--------+--------+--------+

我需要对多个键进行操作,我将具有row2,第3行等

2 个答案:

答案 0 :(得分:2)

scala> df.show
+------+-------------+----+
|region|          key| val|
+------+-------------+----+
|Sample|         row1|   6|
|Sample|row1_category|Cat1|
|Sample|    row1_Unit|  Kg|
|Sample|         row2|   4|
|Sample|row2_category|Cat2|
|Sample|    row2_Unit| ltr|
+------+-------------+----+


scala> val df1 = df.withColumn("_temp", split( $"key" , "_")).select(col("region"), $"_temp".getItem(0) as "key",$"_temp".getItem(1) as "colType",col("val"))


scala> df1.show(false)
+------+----+--------+----+
|region|key |colType |val |
+------+----+--------+----+
|Sample|row1|null    |6   |
|Sample|row1|category|Cat1|
|Sample|row1|Unit    |Kg  |
|Sample|row2|null    |4   |
|Sample|row2|category|Cat2|
|Sample|row2|Unit    |ltr |
+------+----+--------+----+


scala> val df2 = df1.withColumn("Category", when(col("colType") === "category", col("val"))).withColumn("Unit", when(col("colType") === "Unit", col("val"))).withColumn("val", when(col("colType").isNull, col("val")))


scala> df2.show(false)
+------+----+--------+----+--------+----+
|region|key |colType |val |Category|Unit|
+------+----+--------+----+--------+----+
|Sample|row1|null    |6   |null    |null|
|Sample|row1|category|null|Cat1    |null|
|Sample|row1|Unit    |null|null    |Kg  |
|Sample|row2|null    |4   |null    |null|
|Sample|row2|category|null|Cat2    |null|
|Sample|row2|Unit    |null|null    |ltr |
+------+----+--------+----+--------+----+


scala> val df3 = df2.groupBy("region", "key").agg(concat_ws("",collect_set(when($"val".isNotNull, $"val"))).as("val"),concat_ws("",collect_set(when($"Category".isNotNull, $"Category"))).as("Category"), concat_ws("",collect_set(when($"Unit".isNotNull, $"Unit"))).as("Unit"))


scala> df3.show()
+------+----+---+--------+----+
|region| key|val|Category|Unit|
+------+----+---+--------+----+
|Sample|row1|  6|    Cat1|  Kg|
|Sample|row2|  4|    Cat2| ltr|
+------+----+---+--------+----+

答案 1 :(得分:1)

您可以通过按关键字(可能是区域)分组并与collect_list进行聚合来实现,使用ragex ^[^_]+,您将获得所有字符,直到_个字符为止。

更新:您可以使用(\\d{1,})正则表达式从字符串(捕获组)中查找所有数字,例如,如果您拥有row_123_456_unit并且函数看起来像{{1 }}将得到regexp_extract('val,"(\\d{1,})",0),如果将最后一个参数更改为1,则将得到123。希望能帮助到你。 test regex

456

输出:

  df.printSchema()
  df.show()

  val regex1 = "^[^_]+"  // until '_' character
  val regex2 = "(\\d{1,})"  // capture group of numbers

  df.groupBy('region, regexp_extract('key, regex1, 0))
    .agg('region, collect_list('key).as("key"), collect_list('val).as("val"))
    .select('region,
    'key.getItem(0).as("key"),
    'val.getItem(0).as("val"),
    'val.getItem(1).as("Category"),
    'val.getItem(2).as("Unit")
  ).show()