清理因子水平(折叠多个级别/标签)

时间:2013-10-16 17:37:49

标签: r factors r-faq

清除包含需要折叠的多个级别的因子的最有效(即有效/适当)方法是什么?也就是说,如何将两个或多个因子级别合并为一个。

以下是一个示例,其中“是”和“Y”这两个级别应折叠为“是”,“否”和“N”折叠为“否”:

## Given: 
x <- c("Y", "Y", "Yes", "N", "No", "H")   # The 'H' should be treated as NA

## expectedOutput
[1] Yes  Yes  Yes  No   No   <NA>
Levels: Yes No  # <~~ NOTICE ONLY **TWO** LEVELS

当然,一个选项是使用sub和朋友预先清理字符串。

另一种方法是允许重复标签,然后删除它们

## Duplicate levels ==> "Warning: deprecated"
x.f <- factor(x, levels=c("Y", "Yes", "No", "N"), labels=c("Yes", "Yes", "No", "No"))

## the above line can be wrapped in either of the next two lines
factor(x.f)      
droplevels(x.f) 

然而,是否有更有效的方法


虽然我知道levelslabels参数应该是向量,但我尝试使用列表和命名列表以及命名向量来查看会发生什么 不用说,以下没有一个让我更接近我的目标。

  factor(x, levels=list(c("Yes", "Y"), c("No", "N")), labels=c("Yes", "No"))
  factor(x, levels=c("Yes", "No"), labels=list(c("Yes", "Y"), c("No", "N")))

  factor(x, levels=c("Y", "Yes", "No", "N"), labels=c(Y="Yes", Yes="Yes", No="No", N="No"))
  factor(x, levels=c("Y", "Yes", "No", "N"), labels=c(Yes="Y", Yes="Yes", No="No", No="N"))
  factor(x, levels=c("Yes", "No"), labels=c(Y="Yes", Yes="Yes", No="No", N="No"))

10 个答案:

答案 0 :(得分:74)

使用levels函数,并将其传递给命名列表,其名称是级别的所需名称,元素是应重命名的当前名称。

x <- c("Y", "Y", "Yes", "N", "No", "H")
x <- factor(x)
levels(x) <- list(Yes=c("Y", "Yes"), No=c("N", "No"))
x
## [1] Yes  Yes  Yes  No   No   <NA>
## Levels: Yes No

levels文档中所述;也看那里的例子。

  

值:对于'factor'方法,a             矢量字符串,长度至少为数字             'x'的级别,或指定如何重命名的命名列表             水平。

这也可以在一行中完成,正如马雷克在这里所做的那样:https://stackoverflow.com/a/10432263/210673;这里解释levels<-法术https://stackoverflow.com/a/10491881/210673

> `levels<-`(factor(x), list(Yes=c("Y", "Yes"), No=c("N", "No")))
[1] Yes  Yes  Yes  No   No   <NA>
Levels: Yes No

答案 1 :(得分:17)

由于问题标题为清理因子级别(折叠多个级别/标签),因此为了完整起见,此处也应提及forcats包。 forcats于2016年8月在CRAN上出现。

有几种便利功能可用于清理因子水平:

x <- c("Y", "Y", "Yes", "N", "No", "H") 

library(forcats)

将折扣因子级别折叠为手动定义的组

fct_collapse(x, Yes = c("Y", "Yes"), No = c("N", "No"), NULL = "H")
#[1] Yes  Yes  Yes  No   No   <NA>
#Levels: No Yes

手动更改因子水平

fct_recode(x, Yes = "Y", Yes = "Yes", No = "N", No = "No", NULL = "H")
#[1] Yes  Yes  Yes  No   No   <NA>
#Levels: No Yes

自动重新标记因子水平,必要时崩溃

fun <- function(z) {
  z[z == "Y"] <- "Yes"
  z[z == "N"] <- "No"
  z[!(z %in% c("Yes", "No"))] <- NA
  z
}
fct_relabel(factor(x), fun)
#[1] Yes  Yes  Yes  No   No   <NA>
#Levels: No Yes

请注意fct_relabel()适用于因子级别,因此它需要因子作为第一个参数。另外两个函数fct_collapse()fct_recode()也接受字符向量,这是一个未记录的功能。

按首次出现重新排序因子级别

OP给出的预期输出是

[1] Yes  Yes  Yes  No   No   <NA>
Levels: Yes No

此处的级别按x中显示的顺序排序,与默认值不同(?factor默认排序因子的级别)。

为了与预期的输出一致,可以在折叠级别之前使用fct_inorder() 来实现:

fct_collapse(fct_inorder(x), Yes = c("Y", "Yes"), No = c("N", "No"), NULL = "H")
fct_recode(fct_inorder(x), Yes = "Y", Yes = "Yes", No = "N", No = "No", NULL = "H")

现在,两者都以相同的顺序返回预期输出。

答案 2 :(得分:7)

也许命名向量作为键可能有用:

> factor(unname(c(Y = "Yes", Yes = "Yes", N = "No", No = "No", H = NA)[x]))
[1] Yes  Yes  Yes  No   No   <NA>
Levels: No Yes

这看起来与你上次的尝试非常相似......但是这个有效: - )

答案 3 :(得分:5)

另一种方法是创建一个包含映射的表:

# stacking the list from Aaron's answer
fmap = stack(list(Yes = c("Y", "Yes"), No = c("N", "No")))

fmap$ind[ match(x, fmap$values) ]
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes

# or...

library(data.table)
setDT(fmap)[x, on=.(values), ind ]
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes

我更喜欢这种方式,因为它留下了一个易于检查的对象,总结了地图; data.table代码看起来就像该语法中的任何其他连接一样。

当然,如果您不希望像fmap这样的对象总结变更,那么它可以是一个&#34;单行&#34;:

library(data.table)
setDT(stack(list(Yes = c("Y", "Yes"), No = c("N", "No"))))[x, on=.(values), ind ]
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes

答案 4 :(得分:3)

自R 3.5.0(2018-04-23)起,您可以在一条清晰简单的代码行中执行此操作:

x = c("Y", "Y", "Yes", "N", "No", "H") # The 'H' should be treated as NA

tmp = factor(x, levels= c("Y", "Yes", "N", "No"), labels= c("Yes", "Yes", "No", "No"))
tmp
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: Yes No

1行,将多个值映射到同一级别,为缺失级别设置NA” – h / t @Aaron

答案 5 :(得分:2)

我不知道你的真实用例,但strtrim在这里有用......

factor( strtrim( x , 1 ) , levels = c("Y" , "N" ) , labels = c("Yes" , "No" ) )
#[1] Yes  Yes  Yes  No   No   <NA>
#Levels: Yes No

答案 6 :(得分:2)

与@ Aaron的方法类似,但稍微简单一点:

x <- c("Y", "Y", "Yes", "N", "No", "H")
x <- factor(x)
# levels(x)  
# [1] "H"   "N"   "No"  "Y"   "Yes"
# NB: the offending levels are 1, 2, & 4
levels(x)[c(1,2,4)] <- c(NA, "No", "Yes")
x
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes

答案 7 :(得分:2)

我添加此答案以说明接受的答案在数据框中的特定因素上起作用,因为这最初对我而言并不明显(尽管可能应该如此)。

levels(df$var1)
# "0" "1" "Z"
summary(df$var1)
#    0    1    Z 
# 7012 2507    8 
levels(df$var1) <- list("0"=c("Z", "0"), "1"=c("1"))
levels(df$var1)
# "0" "1"
summary(df$var1)
#    0    1 
# 7020 2507

答案 8 :(得分:1)

您可以使用以下功能组合/折叠多个因素:

combofactor <- function(pattern_vector,
         replacement_vector,
         data) {
 levels <- levels(data)
 for (i in 1:length(pattern_vector))
      levels[which(pattern_vector[i] == levels)] <-
        replacement_vector[i]
 levels(data) <- levels
  data
}

示例:

初始化x

x <- factor(c(rep("Y",20),rep("N",20),rep("y",20),
rep("yes",20),rep("Yes",20),rep("No",20)))

检查结构

str(x)
# Factor w/ 6 levels "N","No","y","Y",..: 4 4 4 4 4 4 4 4 4 4 ...

使用功能:

x_new <- combofactor(c("Y","N","y","yes"),c("Yes","No","Yes","Yes"),x)

重新检查结构:

str(x_new)
# Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...

答案 9 :(得分:1)

首先让我们注意,在这种特定情况下,我们可以使用部分匹配:

x <- c("Y", "Y", "Yes", "N", "No", "H")
y <- c("Yes","No")
x <- factor(y[pmatch(x,y,duplicates.ok = TRUE)])
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes

在更一般的情况下,我会选择dplyr::recode

library(dplyr)
x <- c("Y", "Y", "Yes", "N", "No", "H")
y <- c(Y="Yes",N="No")
x <- recode(x,!!!y)
x <- factor(x,y)
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: Yes No

如果起点是一个因素,则略有改变:

x <- factor(c("Y", "Y", "Yes", "N", "No", "H"))
y <- c(Y="Yes",N="No")
x <- recode_factor(x,!!!y)
x <- factor(x,y)
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: Yes No