列转换后的Pyspark随机森林特征重要性映射

时间:2018-06-19 22:08:48

标签: apache-spark pyspark apache-spark-sql apache-spark-mllib

我正在尝试用列名绘制某些基于树的模型的功能重要性。我正在使用Pyspark。

由于我也拥有文本分类变量和数字变量,因此我不得不使用类似这样的管道方法-

  1. 使用字符串索引器为字符串列编制索引
  2. 对所有列使用一个热编码器
  3. 使用向量汇编器创建包含特征向量的特征列

    docs中的一些示例代码用于步骤1,2,3-

    from pyspark.ml import Pipeline
    from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, 
    VectorAssembler
    categoricalColumns = ["workclass", "education", "marital_status", 
    "occupation", "relationship", "race", "sex", "native_country"]
     stages = [] # stages in our Pipeline
     for categoricalCol in categoricalColumns:
        # Category Indexing with StringIndexer
        stringIndexer = StringIndexer(inputCol=categoricalCol, 
        outputCol=categoricalCol + "Index")
        # Use OneHotEncoder to convert categorical variables into binary 
        SparseVectors
        # encoder = OneHotEncoderEstimator(inputCol=categoricalCol + "Index", 
        outputCol=categoricalCol + "classVec")
        encoder = OneHotEncoderEstimator(inputCols= 
        [stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
        # Add stages.  These are not run here, but will run all at once later on.
        stages += [stringIndexer, encoder]
    
    numericCols = ["age", "fnlwgt", "education_num", "capital_gain", 
    "capital_loss", "hours_per_week"]
    assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
    assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
    stages += [assembler]
    
    # Create a Pipeline.
    pipeline = Pipeline(stages=stages)
    # Run the feature transformations.
    #  - fit() computes feature statistics as needed.
    #  - transform() actually transforms the features.
    pipelineModel = pipeline.fit(dataset)
    dataset = pipelineModel.transform(dataset)
    
  4. 最终训练模型

    经过培训和评估后,我可以使用“ model.featureImportances”来获得特征排名,但是我没有得到特征/列名称,而只是获得特征编号,像这样-

    print dtModel_1.featureImportances
    
    (38895,[38708,38714,38719,38720,38737,38870,38894],[0.0742343395738,0.169404823667,0.100485791055,0.0105823115814,0.0134236162982,0.194124862158,0.437744255667])
    

如何将其映射回初始列名称和值?这样我就可以绘图了吗?**

3 个答案:

答案 0 :(得分:6)

通过shown here将元数据提取为user6910411

attrs = sorted(
    (attr["idx"], attr["name"]) for attr in (chain(*dataset
        .schema["features"]
        .metadata["ml_attr"]["attrs"].values())))

并结合功能重要性:

[(name, dtModel_1.featureImportances[idx])
 for idx, name in attrs
 if dtModel_1.featureImportances[idx]]

答案 1 :(得分:1)

转换后的数据集metdata具有必需的属性。这是一种简单的方法-

  1. 创建一个熊猫数据框(通常功能列表不会很大,因此在存储熊猫DF时不会出现内存问题)

    pandasDF = pd.DataFrame(dataset.schema["features"].metadata["ml_attr"] 
    ["attrs"]["binary"]+dataset.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")
    
  2. 然后创建要映射的广播词典。在分布式环境中广播是必须的。

    feature_dict = dict(zip(pandasDF["idx"],pandasDF["name"])) 
    
    feature_dict_broad = sc.broadcast(feature_dict)
    

答案 2 :(得分:0)

在创建汇编程序时,您使用了变量列表(assemblerInputs)。订单保留在“功能”变量中。因此,只需做一个Pandas DataFrame:

features_imp_pd = (
     pd.DataFrame(
       dtModel_1.featureImportances.toArray(), 
       index=assemblerInputs, 
       columns=['importance'])
)