在pyspark中使用pandas_udf过滤数据帧

时间:2019-05-09 10:37:05

标签: python pandas apache-spark pyspark user-defined-functions

我有一个具有以下架构的spark数据框:

root
 |-- idvalue: string (nullable = true)
 |-- locationaccuracyhorizontal: float (nullable = true)
 |-- hour: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- is_weekend: boolean (nullable = true)
 |-- locationlatrad: float (nullable = true)
 |-- locationlonrad: float (nullable = true)
 |-- epochtimestamp: integer (nullable = true)
 |-- velocity: float (nullable = true)

相关列的样本数据如下所示:

+--------------------+--------------------------+----+---+--------------+-----------+
|             idvalue|locationaccuracyhorizontal|hour|day|epochtimestamp|   velocity|
+--------------------+--------------------------+----+---+--------------+-----------+
|000xxxxx-yyyy-zzz...|                      32.0|  23|  9|    1554853730|       null|
|000xxxxx-yyyy-zzz...|                     165.0|   0| 10|    1554854501|   0.635121|
|000xxxxx-yyyy-zzz...|                      65.0|   0| 10|    1554854814| 0.96369237|
|000xxxxx-yyyy-zzz...|                     165.0|   0| 10|    1554855465|  0.3710725|
|000xxxxx-yyyy-zzz...|                    2000.0|   0| 10|    1554857260|   2.383398|
|000xxxxx-yyyy-zzz...|                    3000.0|   0| 10|    1554857625|  26.000359|
|000xxxxx-yyyy-zzz...|                      96.0|   0| 10|    1554857919|  30.961931|
|                                        ...                                        |
|000xxxxx-yyyy-zzz...|                      32.0|  10| 11|    1554977822|   55.37194|
+--------------------+--------------------------+----+---+--------------+-----------+

我想用 pandas_udf idvalue分组后执行过滤操作。我尝试了以下方法:

@pandas_udf(df1.schema, PandasUDFType.GROUPED_MAP)
def filter_data(pdf): 
    idvalue = pdf.idvalue
    hour = pdf.hour        
    return pdf.query('hour > @MIN_NIGHT_HOUR AND hour < @MAX_NIGHT_HOUR')

df2 = df1.groupBy('idvalue') \
        .apply(filter_data).show() 

但是它显示以下错误:

An error occurred while calling o2397.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 74.0 failed 4 times, most recent failure: Lost task 0.3 in stage 74.0 (TID 421, ip-10-0-3-239.eu-west-1.compute.internal, executor 29): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/mnt1/yarn/usercache/livy/appcache/application_1555045880196_0210/container_1555045880196_0210_01_000048/pyspark.zip/pyspark/worker.py", line 372, in main
    process()
  File "/mnt1/yarn/usercache/livy/appcache/application_1555045880196_0210/container_1555045880196_0210_01_000048/pyspark.zip/pyspark/worker.py", line 367, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/mnt1/yarn/usercache/livy/appcache/application_1555045880196_0210/container_1555045880196_0210_01_000048/pyspark.zip/pyspark/serializers.py", line 284, in dump_stream
    batch = _create_batch(series, self._timezone)
  File "/mnt1/yarn/usercache/livy/appcache/application_1555045880196_0210/container_1555045880196_0210_01_000048/pyspark.zip/pyspark/serializers.py", line 253, in _create_batch
    arrs = [create_array(s, t) for s, t in series]
  File "/mnt1/yarn/usercache/livy/appcache/application_1555045880196_0210/container_1555045880196_0210_01_000048/pyspark.zip/pyspark/serializers.py", line 253, in <listcomp>
    arrs = [create_array(s, t) for s, t in series]
  File "/mnt1/yarn/usercache/livy/appcache/application_1555045880196_0210/container_1555045880196_0210_01_000048/pyspark.zip/pyspark/serializers.py", line 251, in create_array
    return pa.Array.from_pandas(s, mask=mask, type=t)
  File "pyarrow/array.pxi", line 542, in pyarrow.lib.Array.from_pandas
  File "pyarrow/array.pxi", line 169, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 78, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Integer value out of bounds

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:156)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:148)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2039)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2027)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2026)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2026)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2260)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2209)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2198)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
    at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:365)
    at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
    at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
    at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2545)
    at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2545)
    at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
    at org.apache.spark.sql.Dataset.head(Dataset.scala:2545)
    at org.apache.spark.sql.Dataset.take(Dataset.scala:2759)
    at org.apache.spark.sql.Dataset.getRows(Dataset.scala:255)
    at org.apache.spark.sql.Dataset.showString(Dataset.scala:292)
    at sun.reflect.GeneratedMethodAccessor338.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

pandas_udf 文档表明

  

分组地图UDF定义了转换:Pandas.DataFrame-> A   pandas.DataFrame

pdf.query的输出也是一个数据帧,所以我很困惑这个错误的背后原因是什么。我也尝试过执行其他过滤查询,但无济于事。

1 个答案:

答案 0 :(得分:0)

我可以重现您的问题epochtimestamp:整数,但是数据很长。

如果将epochtimestamp数据类型更改为Long,它将起作用。

以下代码导致相同的错误

schema = StructType(
    [StructField("idvalue", StringType(), True),
     StructField("hour", LongType(), True),
     StructField("epochtimestamp", IntegerType(), True)]
)

df1 = spark.createDataFrame(
    [('000xxxxx-yyyy-zzz',23,155485373044444),
    ('000xxxxx-yyyy-zzz',0,1554854501),
    ('000xxxxx-yyyy-zzz',0, 1554854814),
    ('000xxxxx-yyyy-zzz',0, 1554855465),
    ('000xxxxx-yyyy-zzz',0, 1554857260),
    ('000xxxxx-yyyy-zzz2',0,1554857625),
    ('000xxxxx-yyyy-zzz1',0, 155485791922)],
    ['idvalue','hour','epochtimestamp'],schema
)

@F.pandas_udf(schema, F.PandasUDFType.GROUPED_MAP)
def filter_data(pdf):
    MIN_NIGHT_HOUR = 0
    MAX_NIGHT_HOUR = 24
    idvalue = pdf.idvalue
    hour = pdf.hour
    return pdf.query('hour > @MIN_NIGHT_HOUR & hour < @MAX_NIGHT_HOUR')

df1.groupBy(
    'idvalue'
).apply(
    filter_data
).show()