Question

我尝试将函数应用于Scala中Spark数据帧中列的所有元素。输入是一个看起来像＆＃34; {count：10}＆＃34;的字符串，我想只返回Int部分 - 在这个例子中10.我可以在玩具示例上执行此操作：

val x = List("{\"count\": 107}", "{\"count\": 9}", "{\"count\": 456}")     
val _list = x.map(x => x.substring(10,x.length-1).toInt)

但是当我尝试将udf应用于我的数据帧时，我收到一个错误：

val getCounts: String => Int = _.substring(10,x.length-1).toInt
import org.apache.spark.sql.functions.udf
val myUDF = udf(getCounts)

df.withColumn("post_shares_int", myUDF('post_shares)).show

错误输出：

    org.apache.spark.SparkException: Task not serializable

at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2060)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706)
    at org.apache.spark.sql.execution.ConvertToSafe.doExecute(rowFormatConverters.scala:56)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
    at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:187)
    at org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
    at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
    at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
    at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
....

非常感谢任何有关如何做到这一点的帮助。

Answer 1

忘记自定义UDF，已有一个可用于该任务的函数，~~即regexp_extract，其中记录了here~~

df.withColumn( "post_shares_int", regexp_extract(df("post_shares"), '^{\\w+:(\\d+)}$', 1) ).show

<击>

按照下面的评论，最好使用get_json_object解析json字符串

df.withColumn( "post_shares_int", get_json_object(df("post_shares"), '$.count') ).show

将map函数应用于Spark数据帧中列的所有元素

1 个答案: