我的数据框my.data包含数字和因子变量。我想标准化此数据框中的数字变量。
> mydata2=data.frame(scale(my.data, center=T, scale=T))
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
标准化可以通过这样做吗?我想标准化列8,9,10,11和12,但我认为我的代码错了。
mydata=data.frame(scale(flowdis3[,c(8,9,10,11,12)], center=T, scale=T,))
提前致谢
答案 0 :(得分:8)
以下是标准化的一个选项
mydata[] <- lapply(mydata, function(x) if(is.numeric(x)){
scale(x, center=TRUE, scale=TRUE)
} else x)
答案 1 :(得分:0)
您可以使用dplyr包执行此操作:
mydata2%>%mutate_if(is.numeric,scale)
答案 2 :(得分:0)
尽管回答晚了,但这里有一些可供考虑的选择:
# Working environment and Memory management
rm(list = ls(all.names = TRUE))
gc()
memory.limit(size = 64935)
# Set working directory
setwd("path")
# Example data frame
df <- data.frame("Age" = c(21, 19, 25, 34, 45, 63, 39, 28, 50, 39),
"Name" = c("Christine", "Kim", "Kevin", "Aishwarya", "Rafel", "Bettina", "Joshua", "Afreen", "Wang", "Kerubo"),
"Salary in $" = c(2137.52, 1515.79, 2212.81, 2500.28, 2660, 4567.45, 2733, 3314, 5757.11, 4435.99),
"Gender" = c("Female", "Female", "Male", "Female", "Male", "Female", "Male", "Female", "Male", "Male"),
"Height in cm" = c(172, 166, 191, 169, 179, 177, 181, 155, 154, 183),
"Weight in kg" = c(60, 70, 88, 48, 71, 51, 65, 44, 53, 91))
让我们检查df的结构:
str(df)
'data.frame': 10 obs. of 6 variables:
$ Age : num 21 19 25 34 45 63 39 28 50 39
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num 2138 1516 2213 2500 2660 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num 172 166 191 169 179 177 181 155 154 183
$ Weight.in.kg: num 60 70 88 48 71 51 65 44 53 91
我们看到“年龄”,“工资”,“身高”和“体重”是数字,而“姓名”和“性别”是分类的(因素变量)。
让我们仅使用基数R缩放数字变量:
1)选项:(对akrun在这里提出的内容进行了一些修改)
start_time1 <- Sys.time()
df1 <- as.data.frame(lapply(df, function(x) if(is.numeric(x)){
(x-mean(x))/sd(x)
} else x))
end_time1 <- Sys.time()
end_time1 - start_time1
Time difference of 0.02717805 secs
str(df1)
'data.frame': 10 obs. of 6 variables:
$ Age : num -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num -0.254 0.365 1.478 -0.996 0.427 ...
2)选项:(阿库伦的方法)
start_time2 <- Sys.time()
df2 <- as.data.frame(lapply(df, function(x) if(is.numeric(x)){
scale(x, center=TRUE, scale=TRUE)
} else x))
end_time2 <- Sys.time()
end_time2 - start_time2
Time difference of 0.02599907 secs
str(df2)
'data.frame': 10 obs. of 6 variables:
$ Age : num -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num -0.254 0.365 1.478 -0.996 0.427 ...
3)选项:
start_time3 <- Sys.time()
indices <- sapply(df, is.numeric)
df3 <- df
df3[indices] <- lapply(df3[indices], scale)
end_time3 <- Sys.time()
end_time2 - start_time3
Time difference of -59.6766 secs
str(df3)
'data.frame': 10 obs. of 6 variables:
$ Age : num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
..- attr(*, "scaled:center")= num 36.3
..- attr(*, "scaled:scale")= num 13.8
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num [1:10, 1] -0.787 -1.255 -0.731 -0.514 -0.394 ...
..- attr(*, "scaled:center")= num 3183
..- attr(*, "scaled:scale")= num 1329
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num [1:10, 1] -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
..- attr(*, "scaled:center")= num 173
..- attr(*, "scaled:scale")= num 12
$ Weight.in.kg: num [1:10, 1] -0.254 0.365 1.478 -0.996 0.427 ...
..- attr(*, "scaled:center")= num 64.1
..- attr(*, "scaled:scale")= num 16.2
4)选项(使用tidyverse和调用dplyr):
library(tidyverse)
start_time4 <- Sys.time()
df4 <-df %>% dplyr::mutate_if(is.numeric, scale)
end_time4 <- Sys.time()
end_time4 - start_time4
Time difference of 0.012043 secs
str(df4)
'data.frame': 10 obs. of 6 variables:
$ Age : num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
..- attr(*, "scaled:center")= num 36.3
..- attr(*, "scaled:scale")= num 13.8
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num [1:10, 1] -0.787 -1.255 -0.731 -0.514 -0.394 ...
..- attr(*, "scaled:center")= num 3183
..- attr(*, "scaled:scale")= num 1329
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num [1:10, 1] -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
..- attr(*, "scaled:center")= num 173
..- attr(*, "scaled:scale")= num 12
$ Weight.in.kg: num [1:10, 1] -0.254 0.365 1.478 -0.996 0.427 ...
..- attr(*, "scaled:center")= num 64.1
..- attr(*, "scaled:scale")= num 16.2
根据您需要的输出类型和速度,可以做出判断。如果您的数据是不平衡的,并且想要平衡它,并假设在缩放数字变量之后要进行分类,则数字变量的矩阵数字结构,即-年龄,工资,身高和体重会引起问题。我的意思是,
str(df4$Age)
num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
- attr(*, "scaled:center")= num 36.3
- attr(*, "scaled:scale")= num 13.8
例如,由于ROSE包(用于平衡数据)不接受除int,factor和num之外的数据结构,因此它将引发错误。
为避免此问题,可以通过以下方法将缩放后的数字变量另存为向量而不是列矩阵:
library(tidyverse)
start_time4 <- Sys.time()
df4 <-df %>% dplyr::mutate_if(is.numeric, ~scale (.) %>% as.vector)
end_time4 <- Sys.time()
end_time4 - start_time4
使用
Time difference of 0.01400399 secs
str(df4)
'data.frame': 10 obs. of 6 variables:
$ Age : num -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num -0.254 0.365 1.478 -0.996 0.427 ...