Question

考虑具有以下架构的Spark DataFrame df：

root 
|-- date: timestamp (nullable = true) 
|-- customerID: string (nullable = true) 
|-- orderID: string (nullable = true) 
|-- productID: string (nullable = true)

一列应强制转换为其他类型，其他列应仅修剪其空白。

df.select(
  $"date",
  df("customerID").cast(IntegerType),
  $"orderID",
  $"productId")
  .withColumn("orderID", trim(col("orderID")))
  .withColumn("productID", trim(col("productID")))

这些操作似乎需要不同的语法；投射通过select完成，而trim通过withColumn完成。我已经习惯了R和dplyr，其中所有上述内容都将在单个mutate函数中处理，因此将select和withColumn混合起来会感到有些麻烦

是否有更清洁的方法在单个管道中执行此操作？

Answer 1

df.select(
  $"date",
  $"customerID".cast(IntegerType),
  trim($"orderID").as("orderID"),
  trim($"productID").as("productID"))

Answer 2

您可以使用任何一个。区别在于withColumn将向数据框中添加（或使用相同名称，则替换）新列，而select仅保留您指定的列。根据情况，选择一种使用。

cast可以使用withColumn完成，如下所示：

df.withColumn("customerID", $"customerID".cast(IntegerType))
  .withColumn("orderID", trim($"orderID"))
  .withColumn("productID", trim($"productID"))

请注意，您不需要在上面的withColumn列上使用date。

trim函数可以在select中完成，如下所示，此处的列名保持不变：

df.select(
  $"date",
  $"customerID".cast(IntegerType),
  trim($"orderID").as("orderID"),
  trim($"productId").as("productId"))

单个管道中有多个Spark DataFrame突变

2 个答案: