Question

我有一组带时间戳的位置数据，其中包含一组附加到每个位置的字符串功能ID。我想在spark中使用一个Window来将当前N和下N行中所有这些feature id字符串的数组拉到一起，ala：

import sys
from pyspark.sql.window import Window
import pyspark.sql.functions as func
windowSpec = Window \
    .partitionBy(df['userid']) \
    .orderBy(df['timestamp']) \
    .rowsBetween(-50, 50)

dataFrame = sqlContext.table("locations")
featureIds = featuresCollector(dataFrame['featureId']).over(windowSpec)
dataFrame.select(
  dataFrame['product'],
  dataFrame['category'],
  dataFrame['revenue'],
  featureIds.alias("allFeatureIds"))

这是否可以使用Spark？如果是这样，我如何编写一个类似featuresCollector的函数来收集窗口中的所有特征ID？

Answer 1

Spark UDF不能用于聚合。 Spark提供了许多工具（UserDefinedAggregateFunctions，Aggregators，AggregateExpressions），可以用于自定义聚合，其中一些可以用于窗口，但是没有一个可以在Python中定义

如果你想要的只是收集记录，collect_list就可以了。请注意，这是一项非常昂贵的操作。

from pyspark.sql.functions import collect_list

featureIds = collect_list('featureId').over(windowSpec)

用户定义的窗口中所有行的功能

1 个答案: