我有一个关于SparkR中 summary()方法的问题,使用随机森林回归。模型构建过程工作正常,但我对算法结果之一的 featureImportance 感兴趣。我想将 featureImportance 变量存储到SparkDataFrame中以使其可视化,但我不知道如何传输/提取它。
model <- spark.randomForest(x1, x2 , x3, type = "regression", maxDepth = 30, maxBins = 50, numTrees=50, impurity="variance", featureSubsetStrategy="all")
summaryRF <- summary(model)
summaryRF$feature:
1. 'x1'
2. 'x2'
3. 'x3'
summaryRF$featureImportances:
'(3,[0,1,2],[0.01324152135,0.0545454422,0.0322122334])'
是否有任何解决方案可以从列表对象中获取 featureImportance 值并将其存储在SparkDataFrame中?
使用 collect()方法会出现以下错误代码:
(函数(classes,fdef,mtable)中的错误:无法为签名'“character”找到函数'collect'的继承方法''
答案 0 :(得分:1)
summaryRF
不再是SparkDataFrame
,这就是collect
无效的原因:))
summaryRF$featureImportances
是character string
(在Spark
方面SparseVector
当前(版本2.1.0)无法序列化为R
string
,我猜是为什么它会被强制转换为# extract the feature indexes and feature importances strings:
fimpList <- strsplit(gsub("\\(.*?\\[","",summaryRF$featureImportances),"\\],\\[")
# split the index and feature importances strings into vectors (and remove "])" from the last record):
fimp <- lapply(fimpList, function(x) strsplit(gsub("\\]\\)","",x),","))
# it's now a list of lists, but you can make this into a dataframe if you like:
fimpDF <- as.data.frame(do.call(cbind,(fimp[[1]])))
。
据我所知,您必须通过直接操作字符串来提取相关位:
Spark
eta:顺便说一下,summaryRF$featureImportances
中的索引从0开始,所以如果要在summaryRf$features
加入featureNameAndIndex <- data.frame(featureName = unlist(summaryRf$features),
featureIndex = c(0:(length(summaryRf$features)-1))),
stringsAsFactors = FALSE)
中的要素名称时合并到"\n"
中的要素索引,则必须考虑到这一点:
<input>