PySpark:VectorAssembler的反向

时间:2018-05-22 10:52:44

标签: python apache-spark pyspark

注意:这看起来像一个很长的帖子,但差不多一半的长度是因为我显示输出。它真的不长。

我有一个我正在执行PCA的数据集。为此,我首先使用VectorAssembler将所需要素组合到一个向量中,然后将其传递给StandardScaler,然后传递给PCA。我们假设我提取的组件数(k)是5。现在PCA的输出将有一个额外的列,它将是一个包含k=5元素的特征向量。我想将此输出特征向量转回列。基本上,我想做VectorAssembler所做的相反的事情。这是我用于执行中心缩放和拟合pca的代码。您可以跳到此代码块的结尾,并查看train_w_pca.show(4)

的输出
# generate some random data
import random
n_cols = 5
n_samples = 1000
train = [[random.randint(random.randint(1,10), random.randint(11,20)) for j in range(n_cols)] for i in range(n_samples)]
col_names = ['col_'+str(i) for i in range(1, 1+n_cols)]
train = spark.createDataFrame(train, col_names)
train.show(4)

+-----+-----+-----+-----+-----+
|col_1|col_2|col_3|col_4|col_5|
+-----+-----+-----+-----+-----+
|   13|   15|    6|   13|    2|
|   11|   10|   10|   18|   12|
|   13|   11|   15|   16|   12|
|    6|    2|    9|    1|   14|
+-----+-----+-----+-----+-----+

from pyspark.ml.feature import PCA, StandardScaler, VectorAssembler
ss_inpt_features = 'ss_inpt_features'
pca_inpt_features = 'pca_inpt_features'
pca_otpt_features = 'pca_otpt_features'

# create feature vector
assembler = VectorAssembler(inputCols=col_names, outputCol=ss_inpt_features)
train_fv = assembler.transform(train)

# scale features
scaler = StandardScaler(inputCol=ss_inpt_features, outputCol=pca_inpt_features, withStd=True, withMean=True)
scaler_model = scaler.fit(train_fv)
train_w_scaling = scaler_model.transform(train_fv)

# fit pca
pca = PCA(k=5, inputCol=pca_inpt_features, outputCol=pca_otpt_features)
pca_model = pca.fit(train_w_scaling)
train_w_pca = pca_model.transform(train_w_scaling)

# remove columns created by vector assembler and standard scaler
drop_cols = [ss_inpt_features, pca_inpt_features]
train_w_pca = train_w_pca.drop(*drop_cols)
train_w_pca.show(4)

+-----+-----+-----+-----+-----+--------------------+
|col_1|col_2|col_3|col_4|col_5|   pca_otpt_features|
+-----+-----+-----+-----+-----+--------------------+
|   13|   15|    6|   13|    2|[-1.4493719150189...|
|   11|   10|   10|   18|   12|[-1.1865568406658...|
|   13|   11|   15|   16|   12|[-0.7908207273087...|
|    6|    2|    9|    1|   14|[3.09817312370792...|
+-----+-----+-----+-----+-----+--------------------+

以下代码块是我寻求建议的地方。我希望实现的最终结果是让我的原始输入功能以及来自PCA的k=5个附加列。我需要PCA输出作为单独的列而不是矢量。要将pca_otpt_features转换为列,请执行以下操作

pca_df = train_w_pca.select(pca_otpt_features).rdd.map(lambda x: x[0].toArray().tolist()).toDF()

+-------------------+-------------------+-------------------+-------------------+-------------------+
|                 _1|                 _2|                 _3|                 _4|                 _5|
+-------------------+-------------------+-------------------+-------------------+-------------------+
|-1.4493719150189412|-2.0931036194154142|0.13464273679638206|-0.8503916932872954| -0.827046144216132|
|-1.1865568406658527| 0.9611918240919298| 0.4488590799950029| 0.2034917117540555|-1.1518683939179468|
|-0.7908207273087919|   1.41447404343126|  0.472864927075101|-0.9315863936330135|-0.3779343594619816|
|  3.098173123707924|-0.1132068349678993| -0.762727469457229| 1.6553163148968495| 0.4664493257688783|
+-------------------+-------------------+-------------------+-------------------+-------------------+

# drop the pca columns
train_w_pca = train_w_pca.drop(pca_otpt_features)
train_w_pca.show(4)
+-----+-----+-----+-----+-----+
|col_1|col_2|col_3|col_4|col_5|
+-----+-----+-----+-----+-----+
|   13|   15|    6|   13|    2|
|   11|   10|   10|   18|   12|
|   13|   11|   15|   16|   12|
|    6|    2|    9|    1|   14|
+-----+-----+-----+-----+-----+

# add index to both df to join
from pyspark.sql.types import LongType
def zipindexdf(pca_df):
    """
    :param pca_df: spark dataframe to which an index column is to be added
    :return: same dataframe but with an additional index column
    """
    schema_new = pca_df.schema.add("index", LongType(), False)
    return pca_df.rdd.zipWithIndex().map(lambda l: list(l[0]) + [l[1]]).toDF(schema_new)

pca_df_index = zipindexdf(pca_df)
train_w_pca_index = zipindexdf(train_w_pca)

+-----+-----+-----+-----+-----+-----+
|col_1|col_2|col_3|col_4|col_5|index|
+-----+-----+-----+-----+-----+-----+
|   13|   15|    6|   13|    2|    0|
|   11|   10|   10|   18|   12|    1|
|   13|   11|   15|   16|   12|    2|
|    6|    2|    9|    1|   14|    3|
+-----+-----+-----+-----+-----+-----+


# join both the df
df_new = train_w_pca_index.join(pca_df_index, "index", "inner")

# drop index column
df_new = df_new.drop('index')
df_new.show(4)

+-----+-----+-----+-----+-----+-------------------+--------------------+--------------------+-------------------+--------------------+
|col_1|col_2|col_3|col_4|col_5|                 _1|                  _2|                  _3|                 _4|                  _5|
+-----+-----+-----+-----+-----+-------------------+--------------------+--------------------+-------------------+--------------------+
|    3|   19|   14|   17|   14|-1.9629750431606783|  0.9103384408550863|   2.439315065646054| 0.7714728122568422|  1.3015345479237306|
|   11|    5|   12|   10|   13| 1.1028687243351978|  0.9414552974121673| -0.3333432052612606|0.17650933263052482| -0.4960485125995481|
|   11|   14|    9|   10|   13|-0.7282369753377401|-0.03908539244478759|-0.38809447002972186|0.34112713979078885|   0.737794945616174|
|    7|    8|    8|   12|   15| 0.1241090961643788|  0.6517788602347092|  0.0936343345345299| 1.5732407177172711|-0.28060308166314357|
+-----+-----+-----+-----+-----+-------------------+--------------------+--------------------+-------------------+--------------------+

有更好的方法吗?我正在做的很多部分就像添加索引,执行连接等似乎没有必要。是否有更清洁,更少痛苦的方法来获得结果?

0 个答案:

没有答案