我已经编写了自己的函数来构建knn模型。
它适用于数字数据。
我的问题是如何为R中的KNN准备分类和混合数据?
我将提供遇到的两种数据。
1- Mixed data
数据的某些行和列
V1 V2 V3 V4 V5 V6
1 39 State-gov 77516 Bachelors 13 Never-married
2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
3 38 Private 215646 HS-grad 9 Divorced
4 53 Private 234721 11th 7 Married-civ-spouse
5 28 Private 338409 Bachelors 13 Married-civ-spouse
6 37 Private 284582 Masters 14 Married-civ-spouse
7 49 Private 160187 9th 5 Married-spouse-absent
8 52 Self-emp-not-inc 209642 HS-grad 9 Married-civ-spouse
9 31 Private 45781 Masters 14 Never-married
10 42 Private 159449 Bachelors 13 Married-civ-spouse
11 37 Private 280464 Some-college 10 Married-civ-spouse
12 30 State-gov 141297 Bachelors 13 Married-civ-spouse
数据的某些行和列
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23
1 p x s n t p f c n k e e s s w w p w o p k s u
2 e x s y t a f c b k e c s s w w p w o p n n g
3 e b s w t l f c b n e c s s w w p w o p n n m
4 p x y w t p f c n n e e s s w w p w o p k s u
5 e x s g f n f w b k t e s s w w p w o e n a g
6 e x y y t a f c b n e c s s w w p w o p k n g
7 e b s w t a f c b g e c s s w w p w o p k n m
8 e b y w t l f c b n e c s s w w p w o p n s m
9 p x y w t p f c n p e e s s w w p w o p k v g
10 e b s y t a f c b g e c s s w w p w o p k s m
11 e x y y t l f c b g e c s s w w p w o p n n g
答案 0 :(得分:1)
带有一列的示例。 (df
是您的混合数据)
library(CatEncoders)
test <- df$V4 # select one column
lenc <- LabelEncoder.fit(test)
print(lenc)
# An object of class "LabelEncoder.Factor"
# Slot "classes":
# [1] 11th 9th Bachelors HS-grad Masters
# [6] Some-college
# Levels: 11th 9th Bachelors HS-grad Masters Some-college
#
# Slot "type":
# [1] "factor"
#
# Slot "mapping":
# classes ind
# 1 11th 1
# 2 9th 2
# 3 Bachelors 3
# 4 HS-grad 4
# 5 Masters 5
# 6 Some-college 6
tranformed_test <- transform(lenc, test)
print(tranformed_test)
# [1] 3 3 4 1 3 5 2 4 5 3 6 3
更新
使用sapply
函数转换数据框中的所有列
t <- function(x) {
# check if x is numeric
if(is.numeric(x)) {
return (x)
}
l <- LabelEncoder.fit(x)
y <- transform(l, x)
return (y)
}
new_df <- sapply(df, t)