SparkR摘要()提取

时间:2017-05-03 08:15:00

标签: r apache-spark sparkr

我有一个关于SparkR中 summary()方法的问题,使用随机森林回归。模型构建过程工作正常,但我对算法结果之一的 featureImportance 感兴趣。我想将 featureImportance 变量存储到SparkDataFrame中以使其可视化,但我不知道如何传输/提取它。

model <- spark.randomForest(x1, x2 , x3, type = "regression", maxDepth = 30, maxBins = 50, numTrees=50, impurity="variance", featureSubsetStrategy="all")

summaryRF <- summary(model)

summaryRF$feature:
1. 'x1'
2. 'x2'
3. 'x3'

summaryRF$featureImportances: 
'(3,[0,1,2],[0.01324152135,0.0545454422,0.0322122334])'

是否有任何解决方案可以从列表对象中获取 featureImportance 值并将其存储在SparkDataFrame中?

使用 collect()方法会出现以下错误代码:

  

(函数(classes,fdef,mtable)中的错误:无法为签名'“character”找到函数'collect'的继承方法''

1 个答案:

答案 0 :(得分:1)

summaryRF不再是SparkDataFrame,这就是collect无效的原因:))

summaryRF$featureImportancescharacter string(在Spark方面SparseVector当前(版本2.1.0)无法序列化为R string,我猜是为什么它会被强制转换为# extract the feature indexes and feature importances strings: fimpList <- strsplit(gsub("\\(.*?\\[","",summaryRF$featureImportances),"\\],\\[") # split the index and feature importances strings into vectors (and remove "])" from the last record): fimp <- lapply(fimpList, function(x) strsplit(gsub("\\]\\)","",x),",")) # it's now a list of lists, but you can make this into a dataframe if you like: fimpDF <- as.data.frame(do.call(cbind,(fimp[[1]])))

据我所知,您必须通过直接操作字符串来提取相关位:

Spark

eta:顺便说一下,summaryRF$featureImportances中的索引从0开始,所以如果要在summaryRf$features加入featureNameAndIndex <- data.frame(featureName = unlist(summaryRf$features), featureIndex = c(0:(length(summaryRf$features)-1))), stringsAsFactors = FALSE) 中的要素名称时合并到"\n"中的要素索引,则必须考虑到这一点:

<input>