Question

我正在寻找一种方法来对数据帧上spark.ml.feature.PCA调用返回的分组数据运行groupBy()函数。但我不确定这是否可能，或者如何实现。这是一个基本的例子，希望能说明我想做的事情：

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import PCA   

df = spark.createDataFrame([[3, 1, 1], [4, 2, 1], [5, 2, 1], [3, 3, 2], [6, 2, 2], [4, 4, 2]], ["Value1", "Value2",  "ID"])

df.show()
+------+------+---+
|Value1|Value2| ID|
+------+------+---+
|     3|     1|  1|
|     4|     2|  1|
|     5|     2|  1|
|     3|     3|  2|
|     6|     2|  2|
|     4|     4|  2|
+------+------+---+

assembler = VectorAssembler(inputCols=["Value1", "Value2"], outputCol="features")

df2 = assembler.transform(df)

df2.show()
+------+------+---+---------+
|Value1|Value2| ID| features|
+------+------+---+---------+
|     3|     1|  1|[3.0,1.0]|
|     4|     2|  1|[4.0,2.0]|
|     5|     2|  1|[5.0,2.0]|
|     3|     3|  2|[3.0,3.0]|
|     6|     2|  2|[6.0,2.0]|
|     4|     4|  2|[4.0,4.0]|
+------+------+---+---------+

pca = PCA(k=1, inputCol="features", outputCol="component")

此时我有数据帧和我想要使用的pca对象。我想现在在数据帧上执行PCA，但按＆＃34; ID＆＃34;分组，所以我会获得IDA的所有功能的PCA，以及ID为2的所有功能的PCA只是返回组件。我可以通过以下方式手动获取这些：

>>>> pca.fit(df2.where("ID==1")).pc
DenseMatrix(2, 1, [-0.8817, -0.4719], 0)
>>>> pca.fit(dff.where("ID==2")).pc
DenseMatrix(2, 1, [-0.8817, 0.4719], 0)

但我想在数据帧中并行执行所有不同的ID，例如：

df2.groupBy("ID").map(lambda group: pca.fit(group).pc)

但是你不能像这样对分组数据使用map()。有没有办法实现这个目标？

Answer 1

火花>=3.0.0

从 Spark 3.0.0 开始，您可以使用 applyInPandas 将一个简单的 Python 函数应用于当前 DataFrame 的每一组，并将结果作为另一个 DataFrame 返回。您基本上需要定义返回的 DataFrame 的输出架构。

这里我将使用 scikit-learn 的 PCA 函数而不是 Spark 实现，因为它必须应用于单个 Pandas DataFrames，而不是 Spark 数据帧。无论如何，要找到的主成分应该是相同的。

import pandas as pd
from sklearn.decomposition import PCA
from pyspark.sql.types import StructField, StructType, DoubleType


# define PCA parameters
cols = ['Value1', 'Value2']
pca_components = 1


# define Python function
def pca_udf(pdf):
    X = pdf[cols]
    pca = PCA(n_components=pca_components)
    PC = pca.fit_transform(X)
    PC_df = pd.DataFrame(PC, columns=['PC_' + str(i+1) for i in range(pca_components)])
    result = pd.concat([pdf, PC_df], axis=1, ignore_index=True)
    return result


# define output schema; principal components are generated dynamically based on `pca_components`
to_append = [StructField('PC_' + str(i+1), DoubleType(), True) for i in range(pca_components)]
output_schema = StructType(df.schema.fields + to_append)


df\
  .groupby('ID')\
  .applyInPandas(pca_udf, output_schema)\
  .show()

+------+------+---+-------------------+
|Value1|Value2| ID|               PC_1|
+------+------+---+-------------------+
|     3|     1|  1| 1.1962465491226262|
|     4|     2|  1|-0.1572859751773413|
|     5|     2|  1|-1.0389605739452852|
|     3|     3|  2|-1.1755661316905914|
|     6|     2|  2|  1.941315590145264|
|     4|     4|  2|-0.7657494584546719|
+------+------+---+-------------------+

火花<3.0.0

在 Spark 3.0.0 之前 - 但仍然使用 Spark>=2.3.0 - 解决方案类似，但我们需要实际定义一个 pandas_udf，一个由 Spark 执行的矢量化用户定义函数，使用 Arrow 传输数据和 Pandas 来处理数据。无论如何，定义它的概念与之前的相似。

import pandas as pd
from sklearn.decomposition import PCA
from pyspark.sql.types import StructField, StructType, DoubleType
from pyspark.sql.functions import pandas_udf, PandasUDFType


# macro-function that includes the pandas_udf and allows to pass it some parameters
def pca_by_group(df, cols, pca_components=1):
    # build output schema for the Pandas UDF
    # principal components are generated dynamically based on `pca_components`
    to_append = [StructField('PC_' + str(i+1), DoubleType(), True) for i in range(pca_components)]
    output_schema = StructType(df.schema.fields + to_append)

    # Pandas UDF for applying PCA within each group
    @pandas_udf(output_schema, functionType=PandasUDFType.GROUPED_MAP)
    def pca_udf(pdf):
        X = pdf[cols]
        pca = PCA(n_components=pca_components)
        PC = pca.fit_transform(X)
        PC_df = pd.DataFrame(PC, columns=['PC_' + str(i+1) for i in range(pca_components)])
        result = pd.concat([pdf, PC_df], axis=1, ignore_index=True)
        return result
    
    # apply the Pandas UDF
    df = df\
        .groupby('ID')\
        .apply(pca_udf)
    
    return df


new_df = pca_by_group(df, cols=['Value1', 'Value2'], pca_components=1)
new_df.show()

+------+------+---+-------------------+
|Value1|Value2| ID|               PC_1|
+------+------+---+-------------------+
|     3|     1|  1| 1.1962465491226262|
|     4|     2|  1|-0.1572859751773413|
|     5|     2|  1|-1.0389605739452852|
|     3|     3|  2|-1.1755661316905914|
|     6|     2|  2|  1.941315590145264|
|     4|     4|  2|-0.7657494584546719|
+------+------+---+-------------------+

在PySpark

1 个答案:

火花>=3.0.0

火花<3.0.0