按名称从字符串向量中删除许多条目

时间:2018-11-18 14:10:00

标签: r string vector

我想从具有500个列名的向量中删除大约100个条目,然后使用该向量将(预测)矩阵m的行设置为零。

作为我的数据框的一个非常简单的示例:

A 1 2 3
B 1 2 3
C 1 2 3
D 1 2 3
E 1 2 3
F 1 2 3
G 1 2 3
H 1 2 3
I 1 2 3
J 1 2 3

首先,我将列名放入向量中:

x <- colnames(df) # x <- c("A","B","C","D","E","F","G,"H","I","J")

假设我要删除B,直到D,F和G直到我(实际上是散布在向量上的大约100个变量,但我不知道它们的索引)。我想做类似的事情:

*remove <- c(B:D, F, G:I)* # This does now work obviously
x [! x %in% remove]

哪个会给我一个向量x,如下所示:

A
E
J

此向量代表需要设置为零的行名(和colname,因为它是一个预测矩阵):

m[x,] <- 0

创建以下输出:

  A B C D E F G H
A 1 0 1 0 1 0 1 0
B 0 0 0 0 0 0 0 0
C 0 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0
E 1 0 1 0 1 0 1 0
F 1 0 1 0 1 0 1 0
G 0 0 0 0 0 0 0 0
H 0 0 0 0 0 0 0 0
I 0 0 0 0 0 0 0 0
J 1 0 1 0 1 0 1 0

如何从所有变量名称的向量中删除这100个变量名称,并使用该向量来引用矩阵的列名称?

3 个答案:

答案 0 :(得分:1)

有趣的用例。我们可以设计一个函数,以您希望的通用方式帮助您完成此任务。


注意:

我在b / c下使用了一个数据框,我不认为最初会提到矩阵(或者我只是想念它),现在各种问题编辑使列和行名变得混乱。 SO 您应该从下面重点关注的是:

# get the terms of the formula
trms <- terms(remove_spec)

# get each element (will be each group separated by `+`
elements <- attr(trms, "term.labels")

# adding in assertions to validate `col` is in `xdf` and that only
# the restricted syntax is used in the formula and that it's valid 
# is up to the OP

# now, find the positions of all those strings
unlist(lapply(elements, function(y) {
  if (grepl(":", y)) {
    rng <- strsplit(y, ":")[[1]]
    which(x[,col] == rng[1]) : which(x[,col] == rng[2])
  } else {
    which(x[,col] == y)
  }
}), use.names = FALSE) -> to_exclude

我现在已经用这个q完成了(行名是1980年代:-)。请注意答案结尾处的警告。

其他人应该在OP的用例的实际矩阵答案中随意使用它。


我们将制作一些模拟数据(这样,如果您需要更大的示例,则可以使示例更大):

library(dplyr) # mostly for saner data frame constructor & printing

set.seed(2018-11-18)

data_frame(
  cat = LETTERS,
  val1 = sample(100, length(cat), replace = TRUE),
  val2 = sample(100, length(cat), replace = TRUE),
  val3 = sample(100, length(cat), replace = TRUE)
) -> xdf

xdf
## # A tibble: 26 x 4
##    cat    val1  val2  val3
##    <chr> <int> <int> <int>
##  1 A        87    98     5
##  2 B        30    69    39
##  3 C        87     1    32
##  4 D        65    46    87
##  5 E         4    69     6
##  6 F        53    20    31
##  7 G        43    51    84
##  8 H        27    43    65
##  9 I        27     9    10
## 10 J        10    94    11
## # ... with 16 more rows

({tibble打印是def >>基本打印IMO,但我离题了)。

现在,您想使用字符串来指定单个元素和范围,并使用某些内容来说明如何进行隐藏。我们需要一个用于的函数,并且我们可以利用特殊的R类forumla来帮助使用更紧凑的语法。即能够调用这样的函数不是很好:

remove_rows(xdf, cat, ~B:C+F+G:I)

,它将在B的{​​{1}}列中寻找“ C”:“ cat”的范围,找到位置“ xdf” ”,然后是“ F”:“ G”的范围,并返回排除了这些数据的数据帧?是的,是的。所以,让我们来构建它!

I

现在我们可以将其称为真实货币了。

#' @param x data frame
#' @param col bare column name to use for the comparison
#' @param formula restricted operators are `:` for range and `+` for additing selectors
remove_rows <- function(x, col, remove_spec) {

  # this is pure convenience we could just as easily have forced folks 
  # to pass in a string (and we can modify it to handle both)
  col <- as.character(substitute(col)) 

  # get the terms of the formula
  trms <- terms(remove_spec)

  # get each element (will be each group separated by `+`
  elements <- attr(trms, "term.labels")

  # adding in assertions to validate `col` is in `xdf` and that only
  # the restricted syntax is used in the formula and that it's valid 
  # is up to the OP

  # now, find the positions of all those strings
  unlist(lapply(elements, function(y) {
    if (grepl(":", y)) {
      rng <- strsplit(y, ":")[[1]]
      which(x[,col] == rng[1]) : which(x[,col] == rng[2])
    } else {
      which(x[,col] == y)
    }
  }), use.names = FALSE) -> to_exclude

  # and get rid of those puppies
  x[-to_exclude,]

}

该函数命名不正确,因此您可能需要更改它,并且确实应该添加一些参数检查和验证,但是我相信这可以满足您的要求(假设您真的确定数据框架按照您认为的顺序排列。)

此外,这是不完善的,因为字符串被约束为公式(所述约束之一是,如果没有反引号,则它们不能以数字开头)。但是,您没有提供真实字符串的样本。

答案 1 :(得分:1)

我使用hrbrmstr的答案和很长的解决方法来工作。如果有人可以告诉我如何减少混乱,请让我知道。

# Copy prediction matrix and turn it into a dataframe for the "remove rows" function
varlist <- m
varlist <- as.data.frame(varlist)

# Create a column called "cat" with the rownames for the "remove rows" function
varlist$cat = rownames(varlist)
# Use the function to remove the rows from the copied df
varlist <- remove_rows(varlist, cat, ~B:C+F+G:I)
# Only keep the "cat" column and turn it into a vector
varlist <- varlist$cat
varlist <- varlist[['cat']]
# Copy prediction matrix and use "varlist" to put the correct rows to zero.
m_reduced <- m
m_reduced[ ,varlist] <- 0

如果有人能告诉我如何清理这种怪兽,我会非常高兴。

答案 2 :(得分:0)

这是我的方式:

remove<-function(lets_to_be_removed,names){
    letters_with_names<-1:length(LETTERS) # each value corresponds to a letter
    names(letters_with_names)<-LETTERS # the letters, for example: letters_with_name["A"]==1 is TRUE
    result<-integer()
    for(letters in lets_to_be_removed){
        #check if it is only one letter
        res <- if(length(letters) == 1) letters_with_names[letters] else letters_with_names[letters[1]]:letters_with_names[letters[2]] 
        result<- c(result,res)
    }
    names(result)<-LETTERS[result]
    result #return the indices of the letters
}

您可以通过以下方式调用它:

letters <- list(c("B","D"),"F",c("G","I"))
letters
[[1]]
[1] "B" "D" # B:D sequence
[[2]]
[1] "F" # only one letter
[[3]]
[1] "G" "I" # G:I sequence

indices<-remove(letters,x)
indices # named vector
B C D F G H I 
2 3 4 6 7 8 9

x[ -indices ] # it is faster than [! x %in% indices] but if you want your method  then use [! x %in% names(indices)]
[1] "A" "E" "J"

通常,用于索引整数比使用字符更好和更快。