Question

我需要根据数据类型过滤Spark dataFrame列中的值。我想在列中只包含浮点数。我尝试使用一些正则表达式，但是，在写入csv文件时出错：SparkException: Task not serializable

这是从CSV文件读入dataFrame的方法，然后我过滤了一些列并将它们写回csv文件：

def processDatasetCsvWithSpark(sqlContext: SQLContext, columnNames: Seq[String], filename: String, dfSchema: StructType,
                                 inputFilepath: String, outputFilepath: String) = {
    val testDf = sparkNeo4jWriteBenchmarks.readFromCsvToDfWithCustomSchema(sqlContext, filename, inputFilepath, dfSchema)

    val renamedColsDf = testDf.toDF(columnNames: _*)

    val filteredBioDF = renamedColsDf.withColumn("bio", regexp_replace(renamedColsDf("bio"), forbiddenSymbols, "")).dropDuplicates()

    val filteredFloatDF: DataFrame = filteredBioDF.filter( df => numberRegex.pattern.matcher(filteredBioDF.select("lat:FLOAT").toString()).matches)
    filteredFloatDF
      .write
      .format("csv")
      .option("header", "true")
      .save(outputFilepath + filename + ".csv")
  }

如果没有这个filteredFloatDF计算，DataFrame的编写没有任何问题。那么如何通过Float数据类型过滤我的DataFrame列或有效地编号正则表达式而没有这样的错误呢？

Answer 1

一种方法是使用cast()将列强制转换为FloatType，实质上将所有非浮点值转换为null：

// CSV file content:
// id,value
// 1,50
// 2,null
// 3,60.5
// 4,a

val df = spark.read.
  option("header", true).
  csv("/path/to/csvfile")

import org.apache.spark.sql.types._

val df2 = df.withColumn("val_float", $"value".cast(FloatType))
// +---+-----+---------+
// | id|value|val_float|
// +---+-----+---------+
// |  1|   50|     50.0|
// |  2| null|     null|
// |  3| 60.5|     60.5|
// |  4|    a|     null|
// +---+-----+---------+

如有必要，您可以将FloatType列重新转换回StringType。

通过Scala中的Float列值过滤DataFrame

1 个答案: