Question

我有一个UDF，用户可以通过它对列表中的n个最频繁的元素进行计数。使用窗口函数在当前行之前的一定数量的行上填充列表：

@udf(returnType=ArrayType(IntegerType()))
def most_common(x):
    """Find 2 most common elements"""
    return [y[0] for y in Counter(x).most_common(n=2)]

# Example dataset:
df = sqlContext.createDataFrame(
   [(1, 2, 9), (2, 3, 1), (3, 4, 1), (4, 5, 2),\
    (1, 5, 6), (2, 3, 2), (5, 89, 12), (2, 6, 85),\
    (1, 5, 6), (2, 12, 2), (5, 9, 12), (2, 65, 85),\
    (1, 2, 9), (2, 3, 1), (3, 4, 1), (4, 5, 2),\
    (1, 3, 53), (2, 13, 1), (3, 40, 1), (3, 5, 1),],\
   ("id", "timestamp", "value"))

from pyspark.sql.window import Window
# example window specification:
w = Window.partitionBy("id")\
          .orderBy(F.col("timestamp"))\
          .rangeBetween(start=-10, end=Window.currentRow)

df_agg = df.select("id", "timestamp", most_common(F.collect_list(F.col("timestamp")).over(w)).alias("top2_hours"),)

此UDF在Spyder上的Spark本地实例上运行良好。看到下面的输出：

    \+---+---------+--------------------+----------+
    \| id|timestamp|           all_hours|top2_hours|
    \+---+---------+--------------------+----------+
    \|  5|        9|                 [9]|       [9]|
    \|  5|       89|                [89]|      [89]|
    \|  1|        2|              [2, 2]|       [2]|
    \|  1|        2|              [2, 2]|       [2]|
    \|  1|        3|           [2, 2, 3]|    [2, 3]|
    \|  1|        5|     [2, 2, 3, 5, 5]|    [2, 5]|
    \|  1|        5|     [2, 2, 3, 5, 5]|    [2, 5]|
    \|  3|        4|              [4, 4]|       [4]|
    \|  3|        4|              [4, 4]|       [4]|
    \|  3|        5|           [4, 4, 5]|    [4, 5]|
    \|  3|       40|                [40]|      [40]|
    \|  2|        3|           [3, 3, 3]|       [3]|
    \|  2|        3|           [3, 3, 3]|       [3]|
    \|  2|        3|           [3, 3, 3]|       [3]|
    \|  2|        6|        [3, 3, 3, 6]|    [3, 6]|
    \|  2|       12|    [3, 3, 3, 6, 12]|    [3, 6]|
    \|  2|       13|[3, 3, 3, 6, 12, 13]|    [3, 6]|
    \|  2|       65|                [65]|      [65]|
    \|  4|        5|              [5, 5]|       [5]|
    \|  4|        5|              [5, 5]|       [5]|
    \+---+---------+--------------------+----------+

但是我无法使用Dataiku界面在企业集群上使用它。它返回以下错误：

Py4JJavaError：调用o931.showString时发生错误。：org.apache.spark.SparkException：由于阶段失败而导致作业中止：阶段20.0中的任务4失败4次，最近一次失败：阶段20.0中的任务4.3丢失（TID 62874，enchbcclprcp117.srv.bmogc.net，执行者1）：java.io.IOException：无法运行程序“”：error = 2，没有这样的文件或目录

如果可能的话，我不想为此使用udf。有没有一种方法可以在本地/不使用UDF的情况下查找列表中的n个最频繁的元素（该元素正在由ID划分的定义的时间窗口中填充，如上面的代码所示）？

Pyspark：在没有UDF的列表中查找N个最频繁的元素

0 个答案: