为无监督学习生成合成数据

时间:2012-08-05 22:17:52

标签: r dataset cluster-analysis random-forest unsupervised-learning

我想用随机森林为无监督学习准备数据。 程序如下:

  • 获取数据并将值为1的属性“class”添加到所有示例
  • 从原始数据生成合成数据:
    • 虽然您没有与原始数据构建示例中相同数量的示例:
      • 从原始数据中该属性的所有值中采样新属性值
      • 为所有属性执行此操作并将它们合并为新示例
  • 分配给合成数据值2的属性'class'
  • 将两个数据绑定在一起

最后它看起来像这样:

        ...      Class
                |1
     Original   |1
     Data       |1
                |1
    --------------
                |2
     Synthetic  |2
     Data       |2
                |2

我的R代码如下所示:

library(gtools) #for smartbind()

sample1 <- function(X)   { sample(X, replace=T) } 
g1      <- function(dat) { apply(dat,2,sample1) }

data$class <- rep(1, times=nrow(data)) #add attribute 'class' with value 1

synthData<-data.frame(g1(data[,1:ncol(data)])) #generate synthetic data with sampling from data
synthData$class <- rep(2, times=nrow(synthData)) #attribute 'class' is 2
colnames(synthData) <- colnames(data)
newData <- smartbind(data, synthData) #bind the data together

很明显,我是R的新手,但它确实有效 - 只有一个问题:合成数据中的属性类型与原始数据中的属性类型不同。如果原来他们是nums,现在他们成为因素。如何在生成合成数据时保留相同类型?

谢谢!

Data1(nums成为因素):

  

结构(列表(V2 = c(1.51793,1.51711,1.51645,1.51916,1.51131)   ),V3 = c(13.21,12.89,13.44,14.15,13.69),V4 = c(3.48,3.62,   3.61,0,3.2),V5 = c(1.41,1.57,1.54,2.09,1.81),V6 = c(72.64,   72.96,72.39,72.74,72.81),V7 = c(0.59,0.61,0.66,0,1.76   ),V8 = c(8.43,8.11,8.03,10.88,5.43),V9 = c(0,0,0,0,   1.19),V10 = c(0,0,0,0,0),realClass =结构(c(1L,2L,   2L,5L,6L),. Label = c(“1”,“2”,“3”,“5”,“6”,“7”),class =“factor”)),。Name = c( “V2”,   “V3”,“V4”,“V5”,“V6”,“V7”,“V8”,“V9”,“V10”,“realClass”),row.names = c(27L,   138L,77L,183L,186L),class =“data.frame”)

Data2(因子变为chrs):

  

结构(列表(realClass =结构(c(2L,2L,2L,1L,2L),. Label = c(“e”,   “p”),class =“factor”),V2 =结构(c(6L,3L,4L,6L,6L),. Label = c(“b”,   “c”,“f”,“k”,“s”,“x”),class =“factor”),V3 =结构(c(4L,   4L,3L,1L,1L),. Label = c(“f”,“g”,“s”,“y”),class =“factor”),       V4 =结构(c(5L,5L,5L,3L,4L),。Label = c(“b”,“c”,       “e”,“g”,“n”,“p”,“r”,“u”,“w”,“y”),class =“factor”),       V5 =结构(c(1L,1L,1L,2L,1L),。Label = c(“f”,“t”       ),class =“factor”),V6 =结构(c(3L,9L,3L,6L,3L)       ),。Label = c(“a”,“c”,“f”,“l”,“m”,“n”,“p”,“s”,“y”       ),class =“factor”),V7 =结构(c(2L,2L,2L,2L,2L)       ),。Label = c(“a”,“f”),class =“factor”),V8 =结构(c(1L,       1L,1L,1L,1L),. Label = c(“c”,“w”),class =“factor”),       V9 =结构(c(2L,2L,2L,1L,1L),。Label = c(“b”,“n”       ),class =“factor”),V10 =结构(c(1L,1L,1L,10L,       4L),. Label = c(“b”,“e”,“g”,“h”,“k”,“n”,“o”,“p”,“r”,       “u”,“w”,“y”),class =“factor”),V11 =结构(c(2L,       2L,2L,2L,1L),. Label = c(“e”,“t”),class =“factor”),       V12 =结构(c(NA,NA,NA,1L,1L),. Label = c(“b”,“c”,       “e”,“r”),class =“factor”),V13 =结构(c(3L,2L,3L,       3L,2L),. Label = c(“f”,“k”,“s”,“y”),class =“factor”),       V14 =结构(c(3L,3L,2L,3L,2L),. Label = c(“f”,“k”,       “s”,“y”),class =“factor”),V15 =结构(c(7L,8L,7L,       4L,7L),. Label = c(“b”,“c”,“e”,“g”,“n”,“o”,“p”,“w”,       “y”),class =“factor”),V16 =结构(c(7L,7L,8L,4L,       1L),. Label = c(“b”,“c”,“e”,“g”,“n”,“o”,“p”,“w”,“y”       ),class =“factor”),V17 =结构(c(1L,1L,1L,1L,1L)       ),。Label =“p”,class =“factor”),V18 =结构(c(3L,       3L,3L,3L,3L),. Label = c(“n”,“o”,“w”,“y”),class =“factor”),       V19 =结构(c(2L,2L,2L,2L,2L),. Label = c(“n”,“o”,       “t”),class =“factor”),V20 =结构(c(1L,1L,1L,5L,       3L),. Label = c(“e”,“f”,“l”,“n”,“p”),class =“factor”),       V21 =结构(c(8L,8L,8L,4L,2L),. Label = c(“b”,“h”,       “k”,“n”,“o”,“r”,“u”,“w”,“y”),class =“factor”),V22 =结构(c(5L,       5L,5L,5L,6L),. Label = c(“a”,“c”,“n”,“s”,“v”,“y”),class =“factor”),       V23 =结构(c(3L,3L,5L,1L,2L),. Label = c(“d”,“g”,       “l”,“m”,“p”,“u”,“w”),class =“factor”)),. Name = c(“realClass”,   “V2”,“V3”,“V4”,“V5”,“V6”,“V7”,“V8”,“V9”,“V10”,“V11”,   “V12”,“V13”,“V14”,“V15”,“V16”,“V17”,“V18”,“V19”,“V20”,   “V21”,“V22”,“V23”),row.names = c(4105L,6207L,6696L,2736L,   3756L),class =“data.frame”)

1 个答案:

答案 0 :(得分:2)

您始终可以使用此技巧来创建数字列

numcol <- as.numeric(as.character(factcol))

但我怀疑你的data.frame中有因子变量。 由于apply会返回一个矩阵,如果您的数据中有一个因子,那么所有数字变量也会被强制为因子。

以下是使用玩具数据集

的示例
set.seed(123)
toydat <- data.frame(A = 1:10, B = rnorm(10), C = LETTERS[1:10])
str(toydat)

## 'data.frame':    10 obs. of  3 variables:
##  $ A: int  1 2 3 4 5 6 7 8 9 10
##  $ B: num  -0.5605 -0.2302 1.5587 0.0705 0.1293 ...
##  $ C: Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10

set.seed(1)
str(data.frame(apply(toydat[,1:2], 2, sample, replace = TRUE)))

## 'data.frame':    10 obs. of  2 variables:
##  $ A: num  3 4 6 10 3 9 10 7 7 1
##  $ B: num  1.5587 -0.2302 0.4609 0.0705 -1.2651 ...

# with the factor column C     
set.seed(2)
str(data.frame(apply(toydat[,1:3], 2, sample, replace = TRUE)))

## 'data.frame':    10 obs. of  3 variables:
##  $ A: Factor w/ 6 levels "10"," 2"," 5",..: 2 5 4 2 1 1 2 6 3 4
##  $ B: Factor w/ 8 levels " 0.129288","-0.230177",..: 8 7 6 2 1 5 3 7 1 4
##  $ C: Factor w/ 6 levels "B","D","E","G",..: 4 2 5 1 2 3 1 2 6 1

这是plyr包变得有用的地方,因为你可以控制输出(使用** ply)。但在这种情况下,colwise功能就足够了

require(plyr)
set.seed(2)
mysamplingfun <- colwise(function(x) sample(x, replace = TRUE))
str(mysamplingfun(toydat[,1:3]))

## 'data.frame':    10 obs. of  3 variables:
##  $ A: int  2 8 6 2 10 10 2 9 5 6
##  $ B: num  1.715 1.559 -1.265 -0.23 0.129 ...
##  $ C: Factor w/ 10 levels "A","B","C","D",..: 7 4 9 2 4 5 2 4 10 2