Question

我想用派对库中的cforest函数来衡量功能的重要性。

我的输出变量类似于0级的2000个样本和1级的100个样本。

我认为避免由于类不平衡引起的偏差的一个好方法是使用子样本训练森林的每棵树，使得第1类元素的数量与第0类中元素的数量相同。

有没有这样做？我想的是像n_samples = c(20, 20)

这样的选项

编辑：代码示例

   > iris.cf <- cforest(Species ~ ., data = iris, 
    +                    control = cforest_unbiased(mtry = 2)) #<--- Here I would like to train the forest using a balanced subsample of the data

 > varimp(object = iris.cf)
    Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
     0.048981818  0.002254545  0.305818182  0.271163636 
    >

编辑：也许我的问题不够明确。随机森林是一组决策树。通常，仅使用数据的随机子样本来构造决策树。我希望使用的子样本在类1和类0中具有相同数量的元素。

编辑：我正在寻找的功能肯定可以在randomForest包中找到

sampsize    
Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.

我需要同样的派对套餐。有没有办法得到它？

Answer 1

我会假设你知道你想要完成什么，但是不知道足够的R来做到这一点。

不确定该函数是否提供数据平衡作为参数，但您可以手动执行此操作。下面是我快速拼凑的代码。可能存在更优雅的解决方案。

# just in case
myData <- iris
# replicate everything *10* times. Replicate is just a "loop 10 times".
replicate(10,
    {   
        # split dataset by class and add separate classes to list
        splitList <- split(myData, myData$Species)
        # sample *20* random rows from each matrix in a list
        sampledList <- lapply(splitList, function(dat) { dat[sample(20),] })
        # combine sampled rows to a data.frame
        sampledData <- do.call(rbind, sampledList)

        # your code below
        res.cf <- cforest(Species ~ ., data = sampledData,
                          control = cforest_unbiased(mtry = 2)
                          )
        varimp(object = res.cf)
    }
)

希望你能从这里拿走它。

cforest党不平衡的阶级

1 个答案: