我在Databricks上使用pySpark 2.1。
我已经编写了一个UDF来为pyspark数据帧的每一行生成一个唯一的uuid。我正在使用的数据帧相对较小< 10,000行。永远不应该超越这个。
我知道有内置函数spark函数zipWithIndex()
和zipWithUniqueId()
来生成行索引,但我已经被特别要求使用uuid进行此特定项目
UDF udf_insert_uuid
适用于小型数据集,但似乎与内置的spark函数subtract
发生冲突。
造成此错误的原因:
package.TreeNodeException:绑定属性,树:pythonUDF0#104830
更深入的驱动程序堆栈错误,它也说:
引起:java.lang.RuntimeException:找不到pythonUDF0#104830
这是我在下面运行的代码:
import pandas
from pyspark.sql.functions import *
from pyspark.sql.types import *
import uuid
#define a python function
def insert_uuid():
user_created_uuid = str( uuid.uuid1() )
return user_created_uuid
#register the python function for use in dataframes
udf_insert_uuid = udf(insert_uuid, StringType())
import pandas
from pyspark.sql.functions import *
from pyspark.sql.types import *
list_of_numbers = range(1000,1050)
temp_pandasDF = pandas.DataFrame(list_of_numbers, index=None)
sparkDF = (
spark
.createDataFrame(temp_pandasDF, ["data_points"])
.withColumn("labels", when( col("data_points") < 1025, "a" ).otherwise("b")) #if "values" < 25, then "labels" = "a", else "labels" = "b"
.repartition("labels")
)
sparkDF.createOrReplaceTempView("temp_spark_table")
#add a unique id for each row
#udf works fine in the line of code here
sparkDF = sparkDF.withColumn("id", lit( udf_insert_uuid() ))
sparkDF.show(20, False)
+-----------+------+------------------------------------+
|data_points|labels|id |
+-----------+------+------------------------------------+
|1029 |b |d3bb91e0-9cc8-11e7-9b70-00163e9986ba|
|1030 |b |d3bb95e6-9cc8-11e7-9b70-00163e9986ba|
|1035 |b |d3bb982a-9cc8-11e7-9b70-00163e9986ba|
|1036 |b |d3bb9a50-9cc8-11e7-9b70-00163e9986ba|
|1042 |b |d3bb9c6c-9cc8-11e7-9b70-00163e9986ba|
+-----------+------+------------------------------------+
only showing top 5 rows
list_of_numbers = range(1025,1075)
temp_pandasDF = pandas.DataFrame(list_of_numbers, index=None)
new_DF = (
spark
.createDataFrame(temp_pandasDF, ["data_points"])
.withColumn("labels", when( col("data_points") < 1025, "a" ).otherwise("b")) #if "values" < 25, then "labels" = "a", else "labels" = "b"
.repartition("labels"))
new_DF.show(5, False)
+-----------+------+
|data_points|labels|
+-----------+------+
|1029 |b |
|1030 |b |
|1035 |b |
|1036 |b |
|1042 |b |
+-----------+------+
only showing top 5 rows
values_not_in_new_DF = (new_DF.subtract(sparkDF.drop("id")))
display(values_not_in_new_DF
.withColumn("id", lit( udf_insert_uuid())) #add a column of unique uuid's
)
package.TreeNodeException:绑定属性,树:pythonUDF0#104830 org.apache.spark.sql.catalyst.errors.package $ TreeNodeException:绑定属性,树:pythonUDF0#104830 at org.apache.spark.sql.catalyst.errors.package $ .attachTree(package.scala:56)at at org.apache.spark.sql.catalyst.expressions.BindReferences $$ anonfun $ bindReference $ 1.applyOrElse(BoundAttribute.scala:88)at at org.apache.spark.sql.catalyst.expressions.BindReferences $$ anonfun $ bindReference $ 1.applyOrElse(BoundAttribute.scala:87)at at org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 2.apply(TreeNode.scala:268)at at org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 2.apply(TreeNode.scala:268)at at org.apache.spark.sql.catalyst.trees.CurrentOrigin $ .withOrigin(TreeNode.scala:70)at at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267)at at org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformDown $ 1.apply(TreeNode.scala:273)at at org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformDown $ 1.apply(TreeNode.scala:273)at at org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 4.apply(TreeNode.scala:307)at at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)at at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)at at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273)at at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257)at at org.apache.spark.sql.catalyst.expressions.BindReferences $ .bindReference(BoundAttribute.scala:87)at at org.apache.spark.sql.execution.aggregate.HashAggregateExec $$ anonfun $ 33.apply(HashAggregateExec.scala:473)at at org.apache.spark.sql.execution.aggregate.HashAggregateExec $$ anonfun $ 33.apply(HashAggregateExec.scala:472)at at scala.collection.TraversableLike $$ anonfun $ map $ 1.apply(TraversableLike.scala:244)at at scala.collection.TraversableLike $$ anonfun $ map $ 1.apply(TraversableLike.scala:244)at at scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)at at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)at at scala.collection.TraversableLike $ class.map(TraversableLike.scala:244)at scala.collection.AbstractTraversable.map(Traversable.scala:105)at org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultCode(HashAggregateExec.scala:472)at at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:610)at at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:148)at at org.apache.spark.sql.execution.CodegenSupport $$ anonfun $产生$ 1.apply(WholeStageCodegenExec.scala:83)at at org.apache.spark.sql.execution.CodegenSupport $$ anonfun $产生$ 1.apply(WholeStageCodegenExec.scala:78)at at org.apache.spark.sql.execution.SparkPlan $$ anonfun $ executeQuery $ 1.apply(SparkPlan.scala:135)at at org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)at at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)at at org.apache.spark.sql.execution.CodegenSupport $ class.produce(WholeStageCodegenExec.scala:78)at at org.apache.spark.sql.execution.aggregate.HashAggregateExec.produce(HashAggregateExec.scala:38)at at org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:313)at at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:354)at at org.apache.spark.sql.execution.SparkPlan $$ anonfun $执行$ 1.apply(SparkPlan.scala:114)at at org.apache.spark.sql.execution.SparkPlan $$ anonfun $执行$ 1.apply(SparkPlan.scala:114)at at org.apache.spark.sql.execution.SparkPlan $$ anonfun $ executeQuery $ 1.apply(SparkPlan.scala:135)at at org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)at at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)at at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)at at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)at at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308)at at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)at at org.apache.spark.sql.Dataset.org $ apache $ spark $ sql $ Dataset $$ collectFromPlan(Dataset.scala:2807)at at org.apache.spark.sql.Dataset $$ anonfun $ head $ 1.apply(Dataset.scala:2132)at at org.apache.spark.sql.Dataset $$ anonfun $ head $ 1.apply(Dataset.scala:2132)at at org.apache.spark.sql.Dataset $$ anonfun $ 60.apply(Dataset.scala:2791)at at org.apache.spark.sql.execution.SQLExecution $$ anonfun $ withNewExecutionId $ 1.apply(SQLExecution.scala:87)at at org.apache.spark.sql.execution.SQLExecution $ .withFileAccessAudit(SQLExecution.scala:53)at at org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:70)at at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2790)at at org.apache.spark.sql.Dataset.head(Dataset.scala:2132)at org.apache.spark.sql.Dataset.take(Dataset.scala:2345)at com.databricks.backend.daemon.driver.OutputAggregator $ .withOutputAggregation0(OutputAggregator.scala:81)at at com.databricks.backend.daemon.driver.OutputAggregator $ .withOutputAggregation(OutputAggregator.scala:42)at at com.databricks.backend.daemon.driver.PythonDriverLocal $$ anonfun $ getResultBuffer $ 1.apply(PythonDriverLocal.scala:461)at at com.databricks.backend.daemon.driver.PythonDriverLocal $$ anonfun $ getResultBuffer $ 1.apply(PythonDriverLocal.scala:441)at at com.databricks.backend.daemon.driver.PythonDriverLocal.withInterpLock(PythonDriverLocal.scala:394)at at com.databricks.backend.daemon.driver.PythonDriverLocal.getResultBuffer(PythonDriverLocal.scala:441)at at com.databricks.backend.daemon.driver.PythonDriverLocal.com $ databricks $ backend $ daemon $ driver $ PythonDriverLocal $$ outputSuccess(PythonDriverLocal.scala:428)at at com.databricks.backend.daemon.driver.PythonDriverLocal $$ anonfun $ repl $ 3.apply(PythonDriverLocal.scala:178)at at com.databricks.backend.daemon.driver.PythonDriverLocal $$ anonfun $ repl $ 3.apply(PythonDriverLocal.scala:175)at at com.databricks.backend.daemon.driver.PythonDriverLocal.withInterpLock(PythonDriverLocal.scala:394)at at com.databricks.backend.daemon.driver.PythonDriverLocal.repl(PythonDriverLocal.scala:175)at at com.databricks.backend.daemon.driver.DriverLocal $$ anonfun $ execute $ 2.apply(DriverLocal.scala:230)at at com.databricks.backend.daemon.driver.DriverLocal $$ anonfun $ execute $ 2.apply(DriverLocal.scala:211)at at com.databricks.logging.UsageLogging $$ anonfun $ withAttributionContext $ 1.apply(UsageLogging.scala:173)at at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)at com.databricks.logging.UsageLogging $ class.withAttributionContext(UsageLogging.scala:168)at at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:39)at at com.databricks.logging.UsageLogging $ class.withAttributionTags(UsageLogging.scala:206)at at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:39)at at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:211)at at com.databricks.backend.daemon.driver.DriverWrapper $$ anonfun $ tryExecutingCommand $ 2.apply(DriverWrapper.scala:589)at at com.databricks.backend.daemon.driver.DriverWrapper $$ anonfun $ tryExecutingCommand $ 2.apply(DriverWrapper.scala:589)at at scala.util.Try $ .apply(Try.scala:161)at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:584)at at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:488)at at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:391)at at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:348)at at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:215)at at java.lang.Thread.run(Thread.java:745)引起:java.lang.RuntimeException:无法在[data_points#104799L,标签#104802]中找到pythonUDF0#104830 scala.sys.package $ .error(package.scala:27)at org.apache.spark.sql.catalyst.expressions.BindReferences $$ anonfun $ bindReference $ 1 $$ anonfun $ applyOrElse $ 1.apply(BoundAttribute.scala:94)at at org.apache.spark.sql.catalyst.expressions.BindReferences $$ anonfun $ bindReference $ 1 $$ anonfun $ applyOrElse $ 1.apply(BoundAttribute.scala:88)at at org.apache.spark.sql.catalyst.errors.package $ .attachTree(package.scala:52)... 82更多
答案 0 :(得分:2)
我遇到与运行脚本时相同的错误。我发现使其工作的唯一方法是传递UDF
列而不是参数:
def insert_uuid(col):
user_created_uuid = str( uuid.uuid1() )
return user_created_uuid
udf_insert_uuid = udf(insert_uuid, StringType())
然后在labels
上调用它,例如:
values_not_in_new_DF\
.withColumn("id", udf_insert_uuid("labels"))\
.show()
无需使用lit