Question

我正在使用GLM（在Spark 2.0中使用ML）对具有一个分类自变量的数据运行模型。我正在使用StringIndexer和OneHotEncoder将该列转换为虚拟变量，然后使用VectorAssembler将其与连续的自变量组合成稀疏矢量列。

如果我的列名是continuous和categorical，其中第一列是浮点数列，第二列是表示（在本例中为8）不同类别的字符串列：

string_indexer = StringIndexer(inputCol='categorical', 
                               outputCol='categorical_index')

encoder = OneHotEncoder(inputCol ='categorical_index',
                        outputCol='categorical_vector')

assembler = VectorAssembler(inputCols=['continuous', 'categorical_vector'],
                            outputCol='indep_vars')

pipeline  = Pipeline(stages=string_indexer+encoder+assembler)
model = pipeline.fit(df)
df = model.transform(df)

到目前为止一切正常，我运行模型：

glm = GeneralizedLinearRegression(family='gaussian', 
                                  link='identity',
                                  labelCol='dep_var',
                                  featuresCol='indep_vars')
model = glm.fit(df)
model.params

哪个输出：

DenseVector（[8440.0573,3729.449,4388.9042,2871.1802,4613.7646,5163.3233,5186.6189,5513.1392]）

这很好，因为我可以验证这些系数基本上是正确的（通过其他来源）。但是，我还没有找到一种很好的方法来将这些系数链接到原始列名，我需要这样做（我已经为SO简化了这个模型;还有更多参与。）

列名和系数之间的关系由StringIndexer和OneHotEncoder打破。我找到了一个相当慢的方式：

df[['categorical', 'categorical_index']].distinct()

这给了我一个小数据框，将字符串名称与数字名称相关联，我想我可以将其与稀疏向量中的键相关联？当你考虑数据的规模时，这是非常笨拙和缓慢的。

有更好的方法吗？

Answer 1

我也遇到了确切的问题，我得到了你的解决方案：）

这是基于Scala版本： How to map variable names to features after pipeline

# transform data
best_model = pipeline.fit(df)
best_pred = best_model.transform(df)

# extract features metadata
meta = [f.metadata 
    for f in best_pred.schema.fields 
    if f.name == 'features'][0]

# access feature name and index
features_name_ind = meta['ml_attr']['attrs']['numeric'] + \
    meta['ml_attr']['attrs']['binary']

print features_name_ind[:2]
# [{'name': 'feature_name_1', 'idx': 0}, {'name': 'feature_name_2', 'idx': 1}]

Answer 2

对不起，这似乎是一个非常晚的答案，也许你可能已经想出来了，但无论如何。我最近做了相同的字符串索引器，OneHotEncoder和VectorAssembler的实现，据我所知，以下代码将呈现您正在寻找的内容。

from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

categoricalColumns = ["one_categorical_variable"]
stages = [] # stages in the pipeline


for categoricalCol in categoricalColumns:

# Category Indexing with StringIndexer
stringIndexer = StringIndexer(inputCol=categoricalCol, 
    outputCol=categoricalCol+"Index")

# Using OneHotEncoder to convert categorical variables into binary 
    SparseVectors

encoder = OneHotEncoder(inputCol=stringIndexer.getOutputCol(), 
    outputCol=categoricalCol+"classVec")

# Adding the stages so that they will be run all at once later

stages += [stringIndexer, encoder]

# convert label into label indices using the StringIndexer

label_stringIdx = StringIndexer(inputCol = "Service_Level", outputCol = 
    "label")
stages += [label_stringIdx]

# Transform all features into a vector using VectorAssembler

numericCols = ["continuous_variable"]
assemblerInputs = map(lambda c: c + "classVec", categoricalColumns) + 
    numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

# Creating a Pipeline for Training

pipeline = Pipeline(stages=stages)

# Running the feature transformations.
pipelineModel = pipeline.fit(df)
df = pipelineModel.transform(df)

Answer 3

对于PySpark，以下是将要素索引映射到要素名称的解决方案：

首先，训练模型：

pipeline = Pipeline().setStages([label_stringIdx,assembler,classifier])
model = pipeline.fit(x)

转换数据：

df_output = model.transform(x)

提取要素索引和要素名称之间的映射。将数字属性和二进制属性合并到一个列表中。

numeric_metadata = df_output.select("features").schema[0].metadata.get('ml_attr').get('attrs').get('numeric')
binary_metadata = df_output.select("features").schema[0].metadata.get('ml_attr').get('attrs').get('binary')

merge_list = numeric_metadata + binary_metadata

输出：

[{'name': 'variable_abc', 'idx': 0},
{'name': 'variable_azz', 'idx': 1},
{'name': 'variable_azze', 'idx': 2},
{'name': 'variable_azqs', 'idx': 3},
  ....

Answer 4

我没有研究以前的版本，但是在Spark 2.4.3中，仅使用summary的{{1}}属性就可以检索有关功能的大量信息。

打印GeneralizedLinearRegressionModel会导致以下情况：

summary

可以通过访问内部Java对象来构造Coefficients: Feature Estimate Std Error T Value P Value (Intercept) -0.1742 0.4298 -0.4053 0.6853 x1_enc_(-inf,5.5] -0.7781 0.3661 -2.1256 0.0335 x1_enc_(5.5,8.5] 0.1850 0.3736 0.4953 0.6204 x1_enc_(8.5,9.5] -0.3937 0.4324 -0.9106 0.3625 x45_enc_1-10-7-8-9 -0.5382 0.2718 -1.9801 0.0477 x45_enc_2-3-4-ND 0.5187 0.2811 1.8454 0.0650 x45_enc_5 -0.0456 0.3353 -0.1361 0.8917 x33_enc_1 0.6361 0.4043 1.5731 0.1157 x33_enc_10 0.0059 0.4083 0.0145 0.9884 x33_enc_2-3-4-8-ND 0.6121 0.1741 3.5152 0.0004 x102_enc_(-inf,4.5] 0.5315 0.1695 3.1354 0.0017 (Dispersion parameter for binomial family taken to be 1.0000) Null deviance: 937.7397 on 666 degrees of freedom Residual deviance: 858.8846 on 666 degrees of freedom AIC: 880.8846列：

Feature

可以通过以下串联构造In [131]: glm.summary._call_java('featureNames') Out[131]: ['x1_enc_(-inf,5.5]', 'x1_enc_(5.5,8.5]', 'x1_enc_(8.5,9.5]', 'x45_enc_1-10-7-8-9', 'x45_enc_2-3-4-ND', 'x45_enc_5', 'x33_enc_1', 'x33_enc_10', 'x33_enc_2-3-4-8-ND', 'x102_enc_(-inf,4.5]']列：

Estimate

PS。：This line显示了为什么可以使用内部Java对象检索列In [134]: [glm.intercept] + list(glm.coefficients) Out[134]: [-0.17419580191414719, -0.7781490190325139, 0.1850214800764976, -0.3936963366945294, -0.5382255101657534, 0.5187453074755956, -0.045649677050663987, 0.6360647167539958, 0.00593020879299306, 0.6121475986933201, 0.531510974697773]的原因。

将列名称与pySpark ML

4 个答案: