Question

我在Scala中有一个数据框，在这里我需要为每行应用一个函数：

DELETE user_users, user_meta FROM user_users LEFT JOIN user_meta ON user_meta.UserID = user_users.ID WHERE user_users.ID = 99

我需要编写一个名为postToDB的函数，在该函数中，我需要将失败的记录返回到数据库，最后返回一个行的数据框。

progressBar.valueChanged.connect(progressBar.repaint)

如何为每行应用postToDB函数，并仅将失败的行作为数据帧返回？

Answer 1

您可以使用可选的返回类型。

import spark.implicits._


val df1 // this df is the initial df which has rows in it

val df2 = df1.flatMap( row => postToDB(row))

df2.map(println).getOrElse("successfull")

def postToDB ( val: Row ): Optional[DataFrame] = {
  try {
    //Try inserting to db , this is successful if failed exception is caught.

     None // if success return none
  } catch {
    case ex: Exception => {
     ex.printStackTrace()
     Some(spark.createDF(val)) //return failed rows as a dataframe 
  }
}

Answer 2

The issue is with the return type of postToDB. You return a dataframe which is an illegal type in this context.

A method provided to flatMap should return a Traversable as the interface for flatMap is:

flatMap[U](func: (T) ⇒ TraversableOnce[U])(implicit arg0: Encoder[U]): Dataset[U]

Dataframe does not implement TraversableOnce.

Instead simply return a list, either empty or with the single row as follows:

def postToDB(row: Row): Traversable[Row] = {

  try{
    //Try inserting to db , this is successful if failed exception is caught.
    List()
  } catch {
    case ex: Exception => ex.printStackTrace()
    List(row)
  } 
}

Note: You cannot use Dataframe, Dataset, SparkContext or any of the constructs relating to the distributed elements of spark from within the flatMap method (or any method serialized to the worker including UDF, method to map etc.). Spark allows to send workers only methods which work on standard Scala objects or Spark objects which are not distributed (E.g. Row).

Scala为数据框中的每一行应用一个函数

2 个答案:

您可以使用可选的返回类型。