r中的最大最大缩放/归一化用于列车和测试数据

时间:2017-05-18 14:05:07

标签: r

我希望创建一个函数,它将训练集和测试集作为参数,min-max scale / normalizes并返回训练集并使用相同的最小值和范围到最小 - 最大比例/标准化并返回测试集。

到目前为止,这是我提出的功能:

min_max_scaling <- function(train, test){

  min_vals <- sapply(train, min)
  range1 <- sapply(train, function(x) diff(range(x)))

  # scale the training data

  train_scaled <- data.frame(matrix(nrow = nrow(train), ncol = ncol(train)))

  for(i in seq_len(ncol(train))){
    column <- (train[,i] - min_vals[i])/range1[i]
    train_scaled[i] <- column
  }

  colnames(train_scaled) <- colnames(train)

  # scale the testing data using the min and range of the train data

  test_scaled <- data.frame(matrix(nrow = nrow(test), ncol = ncol(test)))

  for(i in seq_len(ncol(test))){
    column <- (test[,i] - min_vals[i])/range1[i]
    test_scaled[i] <- column
  }

  colnames(test_scaled) <- colnames(test)

  return(list(train = train_scaled, test = test_scaled))
}

最小缩放比例的定义类似于之前在SO上提出的问题 - Normalisation of a two column data using min and max values

我的问题是:
 1.有没有办法对函数中的两个for循环进行矢量化?例如使用sapply()
 2.是否有任何套餐可以让我们做我们想做的事情?

3 个答案:

答案 0 :(得分:5)

关于第二个问题,您可以使用caret包:

library(caret)

train = data.frame(a = 1:3, b = 10:12)
test = data.frame(a = 1:6, b = 7:12)

pp = preProcess(train, method = "range")


predict(pp, train)

#     a   b
# 1 0.0 0.0
# 2 0.5 0.5
# 3 1.0 1.0

predict(pp, test)

#     a    b
# 1 0.0 -1.5
# 2 0.5 -1.0
# 3 1.0 -0.5
# 4 1.5  0.0
# 5 2.0  0.5
# 6 2.5  1.0

此包还定义了其他转换方法,请参阅:http://machinelearningmastery.com/pre-process-your-dataset-in-r/

答案 1 :(得分:2)

normalize <- function(x)
{
    return((x- min(x)) /(max(x)-min(x)))
}

# To get a vector, use apply instead of lapply
as.data.frame(apply(df$name, normalize))

min-max normalization try this may work 

答案 2 :(得分:0)

    set.seed(1984)

### simulating a data set 

df <- data.frame(var1 = rnorm(100,5,3), 
                 var2 = rpois(100,15), 
                 var3 = runif(50,90,100))

df_train <- df[1:60,]
df_test <- df[61:100,]



## the function 

normalize_data <- function(train_set, test_set)  ## the args are the two sets

{ 
  ranges <- sapply(train_set, function(x) max(x)-min(x)) ## range calculation

  normalized_train <- train_set/ranges   # the normalization
  normalized_test <- test_set/ranges

  return(list(ranges = ranges,                    # returning a list 
              normalized_train= normalized_train,
              normalized_test =normalized_test ))
  }


z <- normalize_data(df_train, df_test)   ## applying the function 

    ## the results 
    z$ranges
         var1      var2      var3 
    13.051448 22.000000  9.945934 
    > head(z$normalized_train)
             var1      var2     var3
    1  0.47715854 1.1492978 7.289028
    2  0.18322387 0.4545455 4.280883
    3  0.69451066 1.3070668 9.703761
    4 -0.04125108 1.6090169 7.277882
    5  0.35731555 0.7272727 4.133561
    6  0.86120315 0.6032616 9.246209
    > head(z$normalized_train)
             var1      var2     var3
    1  0.47715854 1.1492978 7.289028
    2  0.18322387 0.4545455 4.280883
    3  0.69451066 1.3070668 9.703761
    4 -0.04125108 1.6090169 7.277882
    5  0.35731555 0.7272727 4.133561
    6  0.86120315 0.6032616 9.246209