R中的标签编码器功能?

时间:2016-07-27 18:18:02

标签: r

在python中,scikit有一个很棒的函数叫LabelEncoder,它将分类级别(字符串)映射到整数表示。

R中有什么可以做到的吗?例如,如果有一个名为color的变量,其值为''Blue','Red','Green'},则编码器会翻译:

Blue => 1
Green => 2
Red => 3

并创建一个具有此映射的对象,然后用于以类似的方式转换新数据。

添加: 似乎只有因素才有效,因为没有持久的映射。如果新数据具有来自训练数据的看不见的级别,则整个结构会发生变化。理想情况下,我希望新的级别标记为缺失或“其他”某种方式。

sample_dat <- data.frame(a_str=c('Red','Blue','Blue','Red','Green'))
sample_dat$a_int<-as.integer(as.factor(sample_dat$a_str))
sample_dat$a_int
#[1] 3 1 1 3 2
sample_dat2 <- data.frame(a_str=c('Red','Blue','Blue','Red','Green','Azure'))
sample_dat2$a_int<-as.integer(as.factor(sample_dat2$a_str))
sample_dat2$a_int
# [1] 4 2 2 4 3 1

9 个答案:

答案 0 :(得分:8)

创建数据向量:

colors <- c("red", "red", "blue", "green")

创建一个因素:

factors <- factor(colors)

将因子转换为数字:

as.numeric(factors)

输出:(请注意,这是按字母顺序排列的)

# [1] 3 3 1 2

您还可以设置自定义编号系统:(请注意,输出现在遵循&#34;彩虹颜色顺序&#34;我定义的)

rainbow <- c("red","orange","yellow","green","blue","purple")
ordered <- factor(colors, levels = rainbow)
as.numeric(ordered)
# [1] 1 1 5 4

请参阅?factor

答案 1 :(得分:7)

如果我正确理解你想要什么:

# function which returns function which will encode vectors with values  of 'vec' 
label_encoder = function(vec){
    levels = sort(unique(vec))
    function(x){
        match(x, levels)
    }
}

colors = c("red", "red", "blue", "green")

color_encoder = label_encoder(colors) # create encoder

encoded_colors = color_encoder(colors) # encode colors
encoded_colors

new_colors = c("blue", "green", "green")  # new vector
encoded_new_colors = color_encoder(new_colors)
encoded_new_colors

other_colors = c("blue", "green", "green", "yellow") 
color_encoder(other_colors) # NA's are introduced

# save and restore to disk
saveRDS(color_encoder, "color_encoder.RDS")
c_encoder = readRDS("color_encoder.RDS")
c_encoder(colors) # same result

# dealing with multiple columns

# create data.frame
set.seed(123) # make result reproducible
color_dataframe = as.data.frame(
    matrix(
        sample(c("red", "blue", "green",  "yellow"), 12, replace = TRUE),
        ncol = 3)
)
color_dataframe

# encode each column
for (column in colnames(color_dataframe)){
    color_dataframe[[column]] = color_encoder(color_dataframe[[column]])
}
color_dataframe

答案 2 :(得分:2)

尝试使用CatEncoders软件包。它复制了Python sklearn.preprocessing功能。

# variable to encode values
colors = c("red", "red", "blue", "green")
lab_enc = LabelEncoder.fit(colors)

# new values are transformed to NA
values = transform(lab_enc, c('red', 'red', 'yellow'))
values

# [1]  3  3 NA


# doing the inverse: given the encoded numbers return the labels
inverse.transform(lab_enc, values)
# [1] "red" "red" NA   

我将添加报告带有警告的不匹配标签的功能。

PS:它还具有OneHotEncoder功能。

答案 3 :(得分:1)

我写了下面的内容我认为有效,其效率和/或如何扩展尚未测试

str2Int.fit_transform<-function(df, plug_missing=TRUE){

  list_of_levels=list()  #empty list   

  #loop through the columns
  for (i in 1: ncol(df))
  {

    #only   
    if (is.character(df[,i]) ||  is.factor(df[,i]) ){

      #deal with missing
      if(plug_missing){

        #if factor
        if (is.factor(df[,i])){
          df[,i] = factor(df[,i], levels=c(levels(df[,i]), 'MISSING'))
          df[,i][is.na(df[,i])] = 'MISSING' 


        }else{   #if character

          df[,i][is.na(df[,i])] = 'MISSING' 

        }
      }#end missing IF

      levels<-unique(df[,i]) #distinct levels
      list_of_levels[[colnames(df)[i]]] <- levels #set list with name of the columns to the levels
      df[,i] <- as.numeric(factor(df[,i], levels = levels))

    }#end if character/factor IF


  }#end loop

  return (list(list_of_levels,df)) #return the list of levels and the new DF

}#end of function



str2Int.transform<-function(df,list_of_levels,plug_missing=TRUE)
{
  #loop through the columns
  for (i in 1: ncol(df))
  {

    #only   
    if (is.character(df[,i]) ||  is.factor(df[,i]) ){


      #deal with missing
      if(plug_missing){

        #if factor
        if (is.factor(df[,i])){
          df[,i] = factor(df[,i], levels=c(levels(df[,i]), 'MISSING'))
          df[,i][is.na(df[,i])] = 'MISSING' 


        }else{   #if character

          df[,i][is.na(df[,i])] = 'MISSING' 

        }
      }#end missing IF

      levels=list_of_levels[[colnames(df)[i]]]

      if (! is.null(levels)){
        df[,i] <- as.numeric(factor(df[,i], levels = levels))
      }

    }# character or factor

  }#end of loop

  return(df)

}#end of function




######################################################
# Test the functions
######################################################



###Test fit transform

# as strings
sample_dat <- data.frame(a_fact=c('Red','Blue','Blue',NA,'Green'), a_int=c(1,2,3,4,5), a_str=c('a','b','c','a','v'),stringsAsFactors=FALSE)

result<-str2Int.fit_transform(sample_dat)
result[[1]] #list of levels
result[[2]] #transformed df

#as factors
sample_dat <- data.frame(a_fact=c('Red','Blue','Blue',NA,'Green'), a_int=c(1,2,3,4,5), a_str=c('a','b','c','a','v'),stringsAsFactors=TRUE)

result<-str2Int.fit_transform(sample_dat)
result[[1]] #list of levels
result[[2]] #transformed df



###Test transform
str2Int.transform(sample_dat,result[[1]])

答案 4 :(得分:1)

很难相信为什么没有人提到caret的{​​{1}}函数。

这是一个被广泛搜索的问题,人们不想编写自己的方法或复制并粘贴其他用户的方法,他们想要一个 package ,而dummyVars是最接近的R中caret的内容。

编辑:我现在意识到用户真正想要的是将字符串转换为一个计数数字,该数字只是sklearn,但由于要使用热编码更准确,因此我将其保留在此处分类数据的编码方法。

答案 5 :(得分:0)

# input P to the function below is a dataframe containing only categorical variables
numlevel <- function(P) { 

n <- dim(P)[2]

for(i in 1: n) {

  m <- length(unique(P[[i]]))

levels(P[[i]]) <- c(1:m)

}

return(P)

}

Q <- numlevel(P) 

答案 6 :(得分:0)

df<- mtcars

head(df)

df$cyl  <- factor(df$cyl)

df$carb <- factor(df$carb)
vec <- sapply(df, is.factor)

catlevels <- sapply(df[vec], levels)

#store the levels for each category
#level appearing first is coded as 1, second as 2 so on

df <- sapply(df, as.numeric)

class(df) #matrix

df <- data.frame(df) 

#converting back to dataframe

head(df)

答案 7 :(得分:0)

# Data
Country <- c("France", "Spain", "Germany", "Spain", "Germany", "France")
Age <- c(34, 27, 30, 32, 42, 30)
Purchased <- c("No", "Yes", "No", "No", "Yes", "Yes")
df <- data.frame(Country, Age, Purchased)
df

# Output
  Country Age Purchased
1  France  34        No
2   Spain  27       Yes
3 Germany  30        No
4   Spain  32        No
5 Germany  42       Yes
6  France  30       Yes

使用CatEncoders软件包:分类变量的编码器

library(CatEncoders)

# Saving names of categorical variables
factors <- names(which(sapply(df, is.factor)))

# Label Encoder
for (i in factors){
  encode <- LabelEncoder.fit(df[, i])
  df[, i] <- transform(encode, df[, i])
}
df

# Output
  Country Age Purchased
1       1  34         1
2       3  27         2
3       2  30         1
4       3  32         1
5       2  42         2
6       1  30         2

使用R base:因子函数

# Label Encoder
levels <- c("France", "Spain", "Germany", "No", "Yes")
labels <- c(1, 2, 3, 1, 2)
for (i in factors){
  df[, i] <- factor(df[, i], levels = levels, labels = labels, ordered = TRUE)
}
df

# Output
  Country Age Purchased
1       1  34         1
2       2  27         2
3       3  30         1
4       2  32         1
5       3  42         2
6       1  30         2

答案 8 :(得分:0)

这是一个简单整洁的解决方案:

从superml包中: https://www.rdocumentation.org/packages/superml/versions/0.5.3 有一个LabelEncoder类: https://www.rdocumentation.org/packages/superml/versions/0.5.3/topics/LabelEncoder

install.packages("superml")
library(superml)

lbl <- LabelEncoder$new()
lbl$fit(sample_dat$column)
sample_dat$column <- lbl$fit_transform(sample_dat$column)
decode_names <- lbl$inverse_transform(sample_dat$column)