I'm creating a decision tree with the R rpart package based on x number of variables and a dataframe:
fit<-rpart(y~x1+x2+x3+x4,data=(mydataframe),
control=rpart.control(minsplit = 20, minbucket = 0, cp=.01))
But instead of using the entire dataframe, I have four or five subsets of data that are factors, let's say separated out by x4. How can I run decision trees on all of these factors at once instead of having to call subsets of the data again and again?
Based on a search of SO, it looks like either BY or ddply might be the right choice. Here's what I've tried for ddply:
fit<-ddply(mydataframe, dataframe$x4, function (df)
rpart(y~x1+x2+x3+x4,data=(df),
control=rpart.control(minsplit = 20, minbucket = 0, cp=.01)))
but what I'm getting back is:
Error in eval(expr, envir, enclos) : object 'x4value' not found
where x4value is one of the variable values I'd like to split out by. So I have a column of values:
x4
BucketName1
BucketName2
BucketName3
BucketName4
str(mydataframe) shows that $x4 is a : Factor w/ 8 levels and no symbols.
Additionally, I ran mydataframe = na.omit(dataframe) at the very beginning to avoid nulls.
Possible issues I've already troubleshooted:
The rpart bit runs fine when I run it manually as such:
mydataframe<-subset(trainData, x4=="BucketName1")
fit<-rpart(y~x1+x2+x3+x4,data=(mydataframe),
control=rpart.control(minsplit = 20, minbucket = 0, cp=.01))
but borks whenever I try to loop through all subsets using ddply.
Complete reproducible sample code:
mydataframe<-data.frame ( x1=sample(1:10),
x2=sample(1:10),
x3=sample(1:10),
x4= sample(letters[1:4], 20, replace = TRUE))
str(mydataframe)
fit<-ddply(mydataframe, mydataframe$x4, function (df)
rpart(y~x1+x2+x3+x4,data=(df), control=rpart.control(minsplit = 20, minbucket = 0, cp=.01)))
Output:
str(mydataframe) 'data.frame': 20 obs. of 4 variables: $ x1: int 1 6 8 4 7 9 3 2 10 5 ... $ x2: int 9 4 5 8 6 3 7 10 2 1 ... $ x3: int 2 6 5 3 1 4 9 7 10 8 ... $ x4: Factor w/ 4 levels "a","b","c","d": 4 4 3 2 3 4 3 3 1 3 ...
> fit<-ddply(mydataframe, mydataframe$x4, function (df) rpart(y~x1+x2+x3+x4,data=(df), control=rpart.control(minsplit = 20, minbucket = 0, cp=.01))) Error in eval(expr, envir, enclos) : object 'd' not found
答案 0 :(得分:1)
你想用你的代码做两件事:
使用dlply
代替ddply
,因为您需要一个rpart对象列表而不是(?)的数据框。如果您想显示原始数据的预测值,ddply
会很有用,因为可以将其格式化为数据框。
在.(x4)
中使用dataframe$x4
代替dlply
。使用后者将产生不可预测的结果。
此外,在您的示例中,您应指定y
值并从....
之后删除x4
答案 1 :(得分:0)
您将错误的值传递给dplyr()
.variables=
参数。您应该传递引用的变量名称,公式或变量名称的字符向量。由于您正在将mydataframe$v4
传递给一个角色,并且它正在寻找该列中的所有值,就好像它们是变量一样。
这是电话应该是什么样子
fit<-ddply(mydataframe, ~x4, function (df)
rpart(y~x1+x2+x3+x4,data=(df), control=rpart.control(minsplit = 20, minbucket = 0, cp=.01)))
或
fit<-ddply(mydataframe, .(x4), function (df)
rpart(y~x1+x2+x3+x4,data=(df), control=rpart.control(minsplit = 20, minbucket = 0, cp=.01)))
或
fit<-ddply(mydataframe, "x4", function (df)
rpart(y~x1+x2+x3+x4,data=(df), control=rpart.control(minsplit = 20, minbucket = 0, cp=.01)))
答案 2 :(得分:0)
如果您对plyr不熟悉,也可以使用基本R功能执行此操作。
splitData = split(mydataframe, mydataframe$x4)
getModel = function(df) {
fit <- rpart(y~x1+x2+x3+x4+xN....,data=df,
control=rpart.control(minsplit = 20, minbucket = 0, cp=.01)))
return(fit)
}
models = lapply(splitData, getModel)
您也可以使用dplyr而不是plyr执行此操作。
mydataframe %>% group_by(x4) %>%
do(model = getModel(.))