我想从具有500个列名的向量中删除大约100个条目,然后使用该向量将(预测)矩阵m
的行设置为零。
作为我的数据框的一个非常简单的示例:
A 1 2 3
B 1 2 3
C 1 2 3
D 1 2 3
E 1 2 3
F 1 2 3
G 1 2 3
H 1 2 3
I 1 2 3
J 1 2 3
首先,我将列名放入向量中:
x <- colnames(df) # x <- c("A","B","C","D","E","F","G,"H","I","J")
假设我要删除B,直到D,F和G直到我(实际上是散布在向量上的大约100个变量,但我不知道它们的索引)。我想做类似的事情:
*remove <- c(B:D, F, G:I)* # This does now work obviously
x [! x %in% remove]
哪个会给我一个向量x
,如下所示:
A
E
J
此向量代表需要设置为零的行名(和colname,因为它是一个预测矩阵):
m[x,] <- 0
创建以下输出:
A B C D E F G H
A 1 0 1 0 1 0 1 0
B 0 0 0 0 0 0 0 0
C 0 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0
E 1 0 1 0 1 0 1 0
F 1 0 1 0 1 0 1 0
G 0 0 0 0 0 0 0 0
H 0 0 0 0 0 0 0 0
I 0 0 0 0 0 0 0 0
J 1 0 1 0 1 0 1 0
如何从所有变量名称的向量中删除这100个变量名称,并使用该向量来引用矩阵的列名称?
答案 0 :(得分:1)
有趣的用例。我们可以设计一个函数,以您希望的通用方式帮助您完成此任务。
我在b / c下使用了一个数据框,我不认为最初会提到矩阵(或者我只是想念它),现在各种问题编辑使列和行名变得混乱。 SO 您应该从下面重点关注的是:
# get the terms of the formula
trms <- terms(remove_spec)
# get each element (will be each group separated by `+`
elements <- attr(trms, "term.labels")
# adding in assertions to validate `col` is in `xdf` and that only
# the restricted syntax is used in the formula and that it's valid
# is up to the OP
# now, find the positions of all those strings
unlist(lapply(elements, function(y) {
if (grepl(":", y)) {
rng <- strsplit(y, ":")[[1]]
which(x[,col] == rng[1]) : which(x[,col] == rng[2])
} else {
which(x[,col] == y)
}
}), use.names = FALSE) -> to_exclude
我现在已经用这个q完成了(行名是1980年代:-)。请注意答案结尾处的警告。
其他人应该在OP的用例的实际矩阵答案中随意使用它。
我们将制作一些模拟数据(这样,如果您需要更大的示例,则可以使示例更大):
library(dplyr) # mostly for saner data frame constructor & printing
set.seed(2018-11-18)
data_frame(
cat = LETTERS,
val1 = sample(100, length(cat), replace = TRUE),
val2 = sample(100, length(cat), replace = TRUE),
val3 = sample(100, length(cat), replace = TRUE)
) -> xdf
xdf
## # A tibble: 26 x 4
## cat val1 val2 val3
## <chr> <int> <int> <int>
## 1 A 87 98 5
## 2 B 30 69 39
## 3 C 87 1 32
## 4 D 65 46 87
## 5 E 4 69 6
## 6 F 53 20 31
## 7 G 43 51 84
## 8 H 27 43 65
## 9 I 27 9 10
## 10 J 10 94 11
## # ... with 16 more rows
({tibble
打印是def >>基本打印IMO,但我离题了)。
现在,您想使用字符串来指定单个元素和范围,并使用某些内容来说明如何进行隐藏。我们需要一个用于的函数,并且我们可以利用特殊的R类forumla
来帮助使用更紧凑的语法。即能够调用这样的函数不是很好:
remove_rows(xdf, cat, ~B:C+F+G:I)
,它将在B
的{{1}}列中寻找“ C
”:“ cat
”的范围,找到位置“ xdf
” ”,然后是“ F
”:“ G
”的范围,并返回排除了这些数据的数据帧?是的,是的。所以,让我们来构建它!
I
现在我们可以将其称为真实货币了。
#' @param x data frame
#' @param col bare column name to use for the comparison
#' @param formula restricted operators are `:` for range and `+` for additing selectors
remove_rows <- function(x, col, remove_spec) {
# this is pure convenience we could just as easily have forced folks
# to pass in a string (and we can modify it to handle both)
col <- as.character(substitute(col))
# get the terms of the formula
trms <- terms(remove_spec)
# get each element (will be each group separated by `+`
elements <- attr(trms, "term.labels")
# adding in assertions to validate `col` is in `xdf` and that only
# the restricted syntax is used in the formula and that it's valid
# is up to the OP
# now, find the positions of all those strings
unlist(lapply(elements, function(y) {
if (grepl(":", y)) {
rng <- strsplit(y, ":")[[1]]
which(x[,col] == rng[1]) : which(x[,col] == rng[2])
} else {
which(x[,col] == y)
}
}), use.names = FALSE) -> to_exclude
# and get rid of those puppies
x[-to_exclude,]
}
该函数命名不正确,因此您可能需要更改它,并且确实应该添加一些参数检查和验证,但是我相信这可以满足您的要求(假设您真的确定数据框架按照您认为的顺序排列。)
此外,这是不完善的,因为字符串被约束为公式(所述约束之一是,如果没有反引号,则它们不能以数字开头)。但是,您没有提供真实字符串的样本。
答案 1 :(得分:1)
我使用hrbrmstr的答案和很长的解决方法来工作。如果有人可以告诉我如何减少混乱,请让我知道。
# Copy prediction matrix and turn it into a dataframe for the "remove rows" function
varlist <- m
varlist <- as.data.frame(varlist)
# Create a column called "cat" with the rownames for the "remove rows" function
varlist$cat = rownames(varlist)
# Use the function to remove the rows from the copied df
varlist <- remove_rows(varlist, cat, ~B:C+F+G:I)
# Only keep the "cat" column and turn it into a vector
varlist <- varlist$cat
varlist <- varlist[['cat']]
# Copy prediction matrix and use "varlist" to put the correct rows to zero.
m_reduced <- m
m_reduced[ ,varlist] <- 0
如果有人能告诉我如何清理这种怪兽,我会非常高兴。
答案 2 :(得分:0)
这是我的方式:
remove<-function(lets_to_be_removed,names){
letters_with_names<-1:length(LETTERS) # each value corresponds to a letter
names(letters_with_names)<-LETTERS # the letters, for example: letters_with_name["A"]==1 is TRUE
result<-integer()
for(letters in lets_to_be_removed){
#check if it is only one letter
res <- if(length(letters) == 1) letters_with_names[letters] else letters_with_names[letters[1]]:letters_with_names[letters[2]]
result<- c(result,res)
}
names(result)<-LETTERS[result]
result #return the indices of the letters
}
您可以通过以下方式调用它:
letters <- list(c("B","D"),"F",c("G","I"))
letters
[[1]]
[1] "B" "D" # B:D sequence
[[2]]
[1] "F" # only one letter
[[3]]
[1] "G" "I" # G:I sequence
indices<-remove(letters,x)
indices # named vector
B C D F G H I
2 3 4 6 7 8 9
x[ -indices ] # it is faster than [! x %in% indices] but if you want your method then use [! x %in% names(indices)]
[1] "A" "E" "J"
通常,用于索引整数比使用字符更好和更快。