串联字符串中的自定义因子级别

时间:2019-01-06 18:13:57

标签: r refactoring

我有一个因子变量,它由两个_分隔的子字符串组成,例如string1_string2。我想分别设置前缀(“ string1”)和后缀(“ string2”)的因子水平,然后为串联字符串定义整体因子水平集。此外,第一个子字符串与第二个子字符串中的级别优先级可能会有所不同。


我想要实现的一个小例子:

# reproducible data

x <- factor(c("DBO_A", "PH_A", "COND_A", "DBO_B", "PH_B", "COND_B", "DBO_C", "PH_C", "COND_C"))

[1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
Levels: COND_A COND_B COND_C DBO_A DBO_B DBO_C PH_A PH_B PH_C

如果我没有定义因子水平,它们将按字母顺序排列。现在,我想在_分隔符的左侧和右侧设置字符串的级别,例如

    左侧(LHS)上的
  1. PH <COND <DBO
  2. B <A <C (RHS)。

此外,我想指定LHS或RHS哪一方优先于另一方。根据优先级的高低,级别的总体顺序将有所不同:

(1)如果LHS级别是先例:

[1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
Levels: PH_B PH_A PH_C COND_B COND_A COND_C DBO_B DBO_A DBO_C

(2)如果RHS级别是先例:

[1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
Levels: PH_B COND_B DBO_B PH_A COND_A DBO_A PH_C COND_C DBO_C

现在我只是想像解决factor(x, levels = c(xx, xx, ...))这样的问题,但是我的水平比上面显示的要高,所以这看起来很荒谬。

注意:我不想更改数据的顺序,而只需更改级别的顺序。

4 个答案:

答案 0 :(得分:2)

使用CRAN软件包forcats,您可以合并一系列因素。下面的函数将prefixsuffix的2个向量按您希望的顺序输入。
参数sep = "_"的默认设置为问题中的分隔符。您可以根据需要传递另一个分隔符。

library(forcats)

custom_fct <- function(prefix, suffix, sep = "_"){
  lst <- lapply(prefix, function(p){
    f <- paste(p, suffix, sep = sep)
    factor(f, levels = f)
  })
  fct_c(!!!lst)
}

x <- c("PH", "COND", "DBO")
y <- c("B", "A", "C")

custom_fct(x, y)

编辑。

在OP的评论之后我才明白,解决该问题的另一种方法是将输入数据向量x强制分解为2个向量,其中一个是前缀,一个是后缀。以下函数创建了这样的向量,不需要外部程序包。

custom_fct2 <- function(x, prefix, suffix, sep = "_"){
  lst <- lapply(prefix, function(p){
    paste(p, suffix, sep = sep)
  })
  factor(x, levels = unlist(lst))
}

x <- c("DBO_A", "PH_A", "COND_A", "DBO_B",
       "PH_B", "COND_B", "DBO_C", "PH_C", "COND_C")
a <- c("PH", "COND", "DBO")
b <- c("B", "A", "C")

custom_fct2(x, a, b)
#[1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C  
#[9] COND_C
#9 Levels: PH_B PH_A PH_C COND_B COND_A COND_C DBO_B ... DBO_C

答案 1 :(得分:2)

我们可以使用base R来做到这一点。使用sub删除向量levels中的子字符串,match通过检查那些按自定义顺序的值来创建数字索引,然后重新分配levels factor通过order索引对levels向量的序列进行match

i1 <- match(sub("_.*", "", levels(x)), c("PH", "COND", "DBO"))
i2 <- match(sub(".*_", "", levels(x)), c("B", "A", "C"))
factor(x, levels = levels(x)[seq_along(levels(x))[order(i1, i2)]])

对于第二种情况,只需反转order中的索引

factor(x, levels = levels(x)[seq_along(levels(x))[order(i2, i1)]])

要重复使用,可以包装在一个函数中

f1 <- function(vec, lvls1, lvls2, flag = "former") {
   i1 <- match(sub("_.*", "", levels(vec)), lvls1)
   i2 <- match(sub(".*_", "", levels(vec)), lvls2)

   if(flag == 'former') {
     factor(vec, levels = levels(vec)[seq_along(levels(vec))[order(i1, i2)]])
   } else {
     factor(vec, levels = levels(vec)[seq_along(levels(vec))[order(i2, i1)]])

   }


}

f1(x, c("PH", "COND", "DBO"), c("B", "A", "C"))
#[1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
#Levels: PH_B PH_A PH_C COND_B COND_A COND_C DBO_B DBO_A DBO_C


f1(x, c("PH", "COND", "DBO"), c("B", "A", "C"), flag = "latter")
#[1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
#Levels: PH_B COND_B DBO_B PH_A COND_A DBO_A PH_C COND_C DBO_C

答案 2 :(得分:2)

使用data.table便捷功能tstrsplitsetorderv

为子字符串(cols <- c("V1", "V2"))创建一个(任意)列名称的向量。将向量转换为data.tabled <- data.table(x))。将向量分为两列((cols) := tstrsplit(x, split = "_"))。设置子字符串的因子级别(factor(V1, levels = l1))。按第一个子字符串然后第二个子字符串,或第二个然后第一个(setorderv(d, if(prec == 1) cols else rev(cols)))排序数据。使用data.table中有序的列“ x”作为向量'x'(factor(x, levels = d$x))的因子水平。

library(data.table)

f <- function(x, l1, l2, prec){
  cols <- c("V1", "V2")
  d <- data.table(x)
  d[ , (cols) := tstrsplit(x, split = "_")]
  d[ , `:=`(
    V1 = factor(V1, levels = l1),
    V2 = factor(V2, levels = l2))]
  setorderv(d, if(prec == 1) cols else rev(cols))
  factor(x, levels = d$x)
}

# First substring has precedence
f(x, l1 = c("PH", "COND", "DBO"), l2 = c("B", "A", "C"), prec = 1)
# [1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
# Levels: PH_B PH_A PH_C COND_B COND_A COND_C DBO_B DBO_A DBO_C

# Second substring has precedence
f(x, l1 = c("PH", "COND", "DBO"), l2 = c("B", "A", "C"), prec = 2)
# [1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
# Levels: PH_B COND_B DBO_B PH_A COND_A DBO_A PH_C COND_C DBO_C

一种类似的base替代方法,但是将子字符串放置在矩阵中。使用标准的正则表达式(例如,here)来抓取子字符串。转换为系数并设置水平。创建列索引(i <- c(1, 2, 1)[prec:(prec + 1)])。订单等级为'x'(as.character(x)[order(m[ , i[1]], m[ , i[2]])]))。

f2 <- function(x, l1, l2, prec){
  m <- cbind(factor(sub("_.*", "", x), l1), factor(sub(".*_", "", x), l2))
  i <- c(1, 2, 1)[prec:(prec + 1)]
  factor(x, levels = as.character(x)[order(m[ , i[1]], m[ , i[2]])])}

f2(x, l1, l2, prec = 1)
# [1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
# Levels: PH_B PH_A PH_C COND_B COND_A COND_C DBO_B DBO_A DBO_C

f2(x, l1, l2, prec = 2)
# [1] DBO_A  PH_A   COND_A DBO_B  PH_B   COND_B DBO_C  PH_C   COND_C
# Levels: PH_B COND_B DBO_B PH_A COND_A DBO_A PH_C COND_C DBO_C

答案 3 :(得分:-1)

类似的东西

x <- with(expand.grid(x = c("DBO", "PH", "COND"), y = c("A", "B", "C")),
          factor(paste(x, y, sep = "_"), levels = paste(x, y, sep = "_")))

您不需要写出所有可能的级别,只需写出一侧和另一侧的级别即可。