使用mapPartitions

时间:2017-12-13 23:21:29

标签: scala spark-dataframe apache-spark-dataset

我想执行一个类型转换来替换 数据集上某些列的所有值。我知道这可以使用" select"但我希望返回完整的数据集,并更改特定的列值,而不仅仅是单个列。我也知道使用withColumn方法这是可行且简单的,但这被认为是无类型转换。要对类型化转换执行相同操作并获取完整的数据集,我使用mapPartition但遇到问题:

case class Listing(street: String, zip: Int, price: Int)
val list = List(Listing("Main St", 92323, 30000), Listing("1st St", 94331, 10000),Listing("Sunset Ave", 98283, 50000))
val ds = sc.parallelize(list).toDS
val colNames = ds.columns

val newDS = ds.mapPartitions{ iter => val newDSIter = 
    for (row <- iter) yield {
      val newRow = for (i <- 0 until ds.columns.length) yield {
        if (some_condition) { 
          //using reflection to get field value since the column to be
          //processed will be dynamically known based on if condition 
          val value = row.getClass.getDeclaredMethod(colNames(i)).invoke(row).toString
          //send 'value' to some function for processing and returning new value
        } else { 
          //just return field value
          row.getClass.getDeclaredMethod(colNames(i)).invoke(row).toString
        }
      } newRow
    } 
 newDSIter
}

这给了我以下错误:

error: Unable to find encoder for type stored in a Dataset.  Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._  Support for serializing other types will be added in future releases.

我更改了以下行:     newRow.as [清单]

显示错误

error: value as is not a member of scala.collection.immutable.IndexedSeq[String]

这告诉我没有返回Person对象,只是一个字符串集合。

这是在执行类型化转换后返回完整数据集的正确方法吗?在此过程中类型丢失,因为我收回了String集合而不是Person对象?

我的另一个问题是我对打字和无类型转换的困惑。如果为DataFrame严格定义了模式并对其执行了某些转换,为什么它仍然被视为无类型转换?或者,如果在数据集(而不是DataFrame)上调用withColumn方法并且返回的值转换为数据集,那么仍然认为是无类型转换吗?

val newDS = ds.withColumn("zip", some_func).as[Listing]

返回数据集。

编辑:

更新了行返回行(newRow),如下所示:

Listing.getClass.getMethods.find(x => x.getName == "apply" && x.isBridge).get
.invoke(Listing, newRow map (_.asInstanceOf[AnyRef]): _*).asInstanceOf[Listing]

在spark-shell中,这会根据需要返回Dataset [Listing],但在使用sbt编译代码时,会收到错误:

error: Unable to find encoder for type stored in a Dataset.  Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._  Support for serializing other types will be added in future releases.

1 个答案:

答案 0 :(得分:0)

我首先通过将集合转换为案例类(请参阅编辑)并确保导入spark.implicits _来解决此问题。