df1 = data.frame(id=c('A1','2','B3','4','5','6','7','8','9','10'),s1c1=c(0,0.2,0,0.5,0.8,0,0,0,0,0),s1c2=c(0,0,0.3,0,0,0.9,0.3,0,0,0),s1c3=c(0.1,0,0,0,0,0,0,0.2,0.8,0.1))
df2 = data.frame(id=c('A1','2','B3','4','5','6','7','8','9','10'),s2c1=c(0,0.22,0,0.35,0.8,0,0,0,0,0),s2c2=c(0,0,0.23,0,0,0.7,0.3,0,0,0),s2c3=c(0.2,0,0,0,0,0,0,0.4,0.9,0.4))
df <- merge(df1,df2, by="id",all=TRUE)
df$class <- c(0,0,0,0,0,1,1,0,0,0)
> df
id s1c1 s1c2 s1c3 s2c1 s2c2 s2c3 class
10 0.0 0.0 0.1 0.00 0.00 0.4 0
2 0.2 0.0 0.0 0.22 0.00 0.0 0
4 0.5 0.0 0.0 0.35 0.00 0.0 0
5 0.8 0.0 0.0 0.80 0.00 0.0 0
6 0.0 0.9 0.0 0.00 0.70 0.0 0
7 0.0 0.3 0.0 0.00 0.30 0.0 1
8 0.0 0.0 0.2 0.00 0.00 0.4 1
9 0.0 0.0 0.8 0.00 0.00 0.9 0
A1 0.0 0.0 0.1 0.00 0.00 0.2 0
B3 0.0 0.3 0.0 0.00 0.23 0.0 0
我正在使用ROSE函数为不平衡数据生成样本。但是,我希望在ROSE之后保留df中每个观察的id。使用ROSE后,我的输出值低于输出值。
df.rose <- ROSE(class ~ ., data=df, seed=123,N=20,p=0.25)$data
> df.rose
id s1c1 s1c2 s1c3 s2c1 s2c2 s2c3 class
B3 -0.24636399 0.513435064 -0.0844105623 0.04695640 0.419960189 0.08112992 0
9 -0.05029030 0.199689698 0.7022285344 0.08255245 -0.133951228 1.16820765 0
9 -0.23671562 0.167377715 0.9634146745 -0.10923003 -0.129948534 1.00641398 0
B3 -0.16816685 0.434632663 -0.0174671002 -0.07245581 0.423706144 -0.07969934 0
9 -0.14420654 -0.015047974 0.8530741203 -0.22148879 -0.053786877 1.18091542 0
9 -0.38914709 -0.074365870 0.7940190162 -0.23306056 -0.230564666 1.14293933 0
6 0.19329086 0.807524478 -0.0089820194 0.06600218 0.734243934 0.13409831 0
6 0.03538563 0.731147735 0.2867432037 0.09746303 0.673766711 0.05837655 0
4 0.23741363 -0.050535412 -0.0473024899 0.36152575 0.001088718 -0.15354050 0
2 0.48927513 -0.307561385 0.3177238885 0.42054668 0.072770343 0.33271737 0
B3 0.09839211 0.827176406 -0.3244875053 0.44579006 0.159991098 -0.14678016 0
B3 -0.06807770 0.593601657 0.1224855617 -0.10677452 0.351707470 0.53486376 0
9 0.20651979 -0.272977578 0.8259493668 -0.50212781 -0.041644690 1.27476593 0
8 0.00000000 -0.008315345 0.0008152742 0.00000000 0.043469230 0.29596908 1
7 0.00000000 0.155050387 -0.0068404803 0.00000000 0.314397160 -0.50556877 1
7 0.00000000 -0.008021610 0.0639465277 0.00000000 0.122372337 0.27856790 1
8 0.00000000 -0.070217063 0.2370763279 0.00000000 -0.013168583 0.04034823 1
7 0.00000000 0.469712631 0.0130102656 0.00000000 0.566767608 0.18219645 1
7 0.00000000 0.193749720 -0.0788801623 0.00000000 0.383380004 0.47007644 1
7 0.00000000 0.412273782 -0.1046108759 0.00000000 0.307614552 -0.35552820 1
在ROSE之后,我没有得到所有的身份证明。我想得到我所有的身份证明。如果任何人知道通过保留每个观察的id来处理不平衡数据的任何其他方法。我不想搞砸id。我尝试过采样,欠采样,SMOTE。但是,没有好结果。我尝试将id列转换为factor但不起作用。
答案 0 :(得分:1)
如果有人仍然想知道,我最终使用了这种方法。我只需要新的综合观测值,但SMOTE一直在缩小数据集的大小。希望对您有所帮助:
library(DMwR)
library(dplyr)
# df - dataframe you want to use over/undersampling on
df$ID <- seq.int(nrow(df))
df_smote <- DMwR::SMOTE(var ~ ., df, perc.over = 100, k = 5)
sub_df <- subset(df_smote, var == "yes")
final_df <- rbind(df, sub_df)
final_df <- distinct(final_df)
- 创建ID列,以确保行完全相同(不是 具有相同特征的观测值)
- 将SMOTE与所需参数一起使用(其中 var 是二进制变量 您的身体不平衡)。
- 用一定水平的 var 替换综合观测值-在 这种情况是“是”级别。
- 行绑定子集到原始数据集。
- 删除SMOTE中引入的重复项。
- 您最终将获得仅具有合成观测值的原始数据集 期望的水平高于/低于采样。