Question

假设我具有以下Dataset：

+-----------+----------+
|productCode|    amount|
+-----------+----------+
|      XX-13|       300|
|       XX-1|       250|
|       XX-2|       410|
|       XX-9|        50|
|      XX-10|        35|
|     XX-100|       870|
+-----------+----------+

其中productCode是String类型，而amount是Int。

如果尝试按productCode进行排序，结果将是（由于String比较的性质，这是预期的结果）：

def orderProducts(product: Dataset[Product]): Dataset[Product] = {
    product.orderBy("productCode")
}

// Output:
+-----------+----------+
|productCode|    amount|
+-----------+----------+
|       XX-1|       250|
|      XX-10|        35|
|     XX-100|       870|
|      XX-13|       300|
|       XX-2|       410|
|       XX-9|        50|
+-----------+----------+

如何在考虑Integer API的情况下，如何按productCode的{{1}}部分排序的输出？

Dataset

Answer 1

在orderBy中使用表达式。检查一下：

scala> val df = Seq(("XX-13",300),("XX-1",250),("XX-2",410),("XX-9",50),("XX-10",35),("XX-100",870)).toDF("productCode", "amt")
df: org.apache.spark.sql.DataFrame = [productCode: string, amt: int]

scala> df.orderBy(split('productCode,"-")(1).cast("int")).show
+-----------+---+
|productCode|amt|
+-----------+---+
|       XX-1|250|
|       XX-2|410|
|       XX-9| 50|
|      XX-10| 35|
|      XX-13|300|
|     XX-100|870|
+-----------+---+


scala>

使用窗口功能，您可以做到

scala> df.withColumn("row1",row_number().over(Window.orderBy(split('productCode,"-")(1).cast("int")))).show(false)
18/12/10 09:25:07 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+-----------+---+----+
|productCode|amt|row1|
+-----------+---+----+
|XX-1       |250|1   |
|XX-2       |410|2   |
|XX-9       |50 |3   |
|XX-10      |35 |4   |
|XX-13      |300|5   |
|XX-100     |870|6   |
+-----------+---+----+


scala>

请注意，spark抱怨将所有数据移动到单个分区。

在Spark数据集中对数字字符串进行排序

1 个答案: