我希望创建一个函数,它将训练集和测试集作为参数,min-max scale / normalizes并返回训练集并使用相同的最小值和范围到最小 - 最大比例/标准化并返回测试集。
到目前为止,这是我提出的功能:
min_max_scaling <- function(train, test){
min_vals <- sapply(train, min)
range1 <- sapply(train, function(x) diff(range(x)))
# scale the training data
train_scaled <- data.frame(matrix(nrow = nrow(train), ncol = ncol(train)))
for(i in seq_len(ncol(train))){
column <- (train[,i] - min_vals[i])/range1[i]
train_scaled[i] <- column
}
colnames(train_scaled) <- colnames(train)
# scale the testing data using the min and range of the train data
test_scaled <- data.frame(matrix(nrow = nrow(test), ncol = ncol(test)))
for(i in seq_len(ncol(test))){
column <- (test[,i] - min_vals[i])/range1[i]
test_scaled[i] <- column
}
colnames(test_scaled) <- colnames(test)
return(list(train = train_scaled, test = test_scaled))
}
最小缩放比例的定义类似于之前在SO上提出的问题 - Normalisation of a two column data using min and max values
我的问题是:
1.有没有办法对函数中的两个for
循环进行矢量化?例如使用sapply()
2.是否有任何套餐可以让我们做我们想做的事情?
答案 0 :(得分:5)
关于第二个问题,您可以使用caret
包:
library(caret)
train = data.frame(a = 1:3, b = 10:12)
test = data.frame(a = 1:6, b = 7:12)
pp = preProcess(train, method = "range")
predict(pp, train)
# a b
# 1 0.0 0.0
# 2 0.5 0.5
# 3 1.0 1.0
predict(pp, test)
# a b
# 1 0.0 -1.5
# 2 0.5 -1.0
# 3 1.0 -0.5
# 4 1.5 0.0
# 5 2.0 0.5
# 6 2.5 1.0
此包还定义了其他转换方法,请参阅:http://machinelearningmastery.com/pre-process-your-dataset-in-r/
答案 1 :(得分:2)
normalize <- function(x)
{
return((x- min(x)) /(max(x)-min(x)))
}
# To get a vector, use apply instead of lapply
as.data.frame(apply(df$name, normalize))
min-max normalization try this may work
答案 2 :(得分:0)
set.seed(1984)
### simulating a data set
df <- data.frame(var1 = rnorm(100,5,3),
var2 = rpois(100,15),
var3 = runif(50,90,100))
df_train <- df[1:60,]
df_test <- df[61:100,]
## the function
normalize_data <- function(train_set, test_set) ## the args are the two sets
{
ranges <- sapply(train_set, function(x) max(x)-min(x)) ## range calculation
normalized_train <- train_set/ranges # the normalization
normalized_test <- test_set/ranges
return(list(ranges = ranges, # returning a list
normalized_train= normalized_train,
normalized_test =normalized_test ))
}
z <- normalize_data(df_train, df_test) ## applying the function
## the results
z$ranges
var1 var2 var3
13.051448 22.000000 9.945934
> head(z$normalized_train)
var1 var2 var3
1 0.47715854 1.1492978 7.289028
2 0.18322387 0.4545455 4.280883
3 0.69451066 1.3070668 9.703761
4 -0.04125108 1.6090169 7.277882
5 0.35731555 0.7272727 4.133561
6 0.86120315 0.6032616 9.246209
> head(z$normalized_train)
var1 var2 var3
1 0.47715854 1.1492978 7.289028
2 0.18322387 0.4545455 4.280883
3 0.69451066 1.3070668 9.703761
4 -0.04125108 1.6090169 7.277882
5 0.35731555 0.7272727 4.133561
6 0.86120315 0.6032616 9.246209