使用ovun.sample函数的不平衡数据集错误

时间:2018-07-03 07:14:09

标签: r

我正在处理的数据集是不平衡的,所以我试图通过欠采样来平衡数据集,但出现错误如何解决此错误? 这是我得到的错误 函数错误(公式,数据,方法,子集,行为,N,P = 0.5 、: 响应变量只有一个类。 如何解决这个错误?

我尝试过的事情:

library(ROSE)

data_frame <- click.csv

data_frame2 <- buy.csv

colnames(data_frame) [1] "Session ID" "Timestamp" "Item ID" "Category"

colnames(data_frame2) [1] "Session ID" "Timestamp" "Item ID" "Price" "Quantity"

> mydata<- merge(x=data_frame, y=data_frame2, by = "SessionID", all.x = TRUE, allow.cartesian=TRUE)# left outer join mydata
> mydata
Session ID Timestamp.x Item ID.x Category Timestamp.y Item ID.y Price Quantity 1: 1 2014-04-07T10:51:09.277Z 214536502 0 2: 1 2014-04-07T10:54:09.868Z 214536500 0 3: 1 2014-04-07T10:54:46.998Z 214536506 0 4: 1 2014-04-07T10:57:00.306Z 214577561 0 5: 10000001 2014-09-08T10:35:38.841Z 214854230 S --- 40596049: 9999997 2014-09-07T18:12:46.466Z 214854159 S 40596050: 9999997 2014-09-07T18:13:04.315Z 214643036 S 40596051: 9999997 2014-09-07T18:14:47.365Z 214854159 S 40596052: 9999998 2014-09-07T20:53:43.120Z 214541597 0 40596053: 9999999 2014-09-04T04:44:46.942Z 214644650 S
mydataItemID.y[!is.na(mydataItemID.y[!is.na(mydataItemID.y)]<-1
mydataItemID.y[is.na(mydataItemID.y[is.na(mydataItemID.y)]<-0
table(mydata$ItemID.y)
0 1
29698257 10897796
str(mydata) Classes ‘data.table’ and 'data.frame': 40596053 obs. of 8 variables:SessionID:Factorw/9249729levels"1","10000001",..:1111222223...SessionID:Factorw/9249729levels"1","10000001",..:1111222223...Timestamp.x: Factor w/ 32937845 levels "2014-04-01T03:00:00.124Z",..: 1406509 1407501 1407712 1408409 29083768 29085345 29085440 29085649 29088238 29247009 ...ItemID.x:Factorw/52739levels"1178793047","1178794001",..:20832082208499065023064116410502305018748852...ItemID.x:Factorw/52739levels"1178793047","1178794001",..:20832082208499065023064116410502305018748852... Category : Factor w/ 339 levels "0","1","10","11",..: 1 1 1 1 339 339 339 339 339 339 ...Timestamp.y:Factorw/1136477levels"2014−04−01T03:05:31.743Z",..:NANANANANANANANANANA...Timestamp.y:Factorw/1136477levels"2014−04−01T03:05:31.743Z",..:NANANANANANANANANANA...ItemID.y : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...Price:Factorw/735levels"0","10052","1015",..:NANANANANANANANANANA...Price:Factorw/735levels"0","10052","1015",..:NANANANANANANANANANA... Quantity : Factor w/ 28 levels "0","1","10","11",..: NA NA NA NA NA NA NA NA NA NA ... - attr(*, ".internal.selfref")
data_balanced_over <- ovun.sample(ItemID.y ~ ., data = mydata, method = "over",N = 800)
Error in function (formula, data, method, subset, na.action, N, P=0.5, :
The response variable has only one class.

1 个答案:

答案 0 :(得分:0)

由于该示例不可复制,因此我建议使用具有不同功能的替代方法:

library(caret)
x <- matrix(mydata %>% select(-ItemId.y))
y <- as.factor(mydata$ItemId.y)
# x should be the matrix with your regressors
# y should be you factor response variable
downSample(x, y, yname = "ItemId.y") # will randomly sample a data set so that all classes have the same frequency as the minority class

有关工作示例,请参见downSample