希望标题的措辞有意义。我有一个由值组成的数据框:“A”,“B”,“C”,“D”,“”,“A / B”。我想确定哪些行只包含2个“A”,“B”,“C”或“D”。行中每个字母的频率无关紧要。我只是想知道这行中是否有超过2个字母。
以下是一个示例数据框:
df.sample = as.data.frame(rbind(c("A","B","A","A/B","B","B","B","B","","B"),c("A","B","C","A","B","","","B","","B"),c("A","B","D","D","B","B","B","B","","B"),c("A","B","A","A","B","B","B","B","B","B")))
df.sample
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 A B A A/B B B B B B
2 A B C A B B B
3 A B D D B B B B B
4 A B A A B B B B B B
我想在每一行中应用一个函数来确定4个字母(“A”,“B”,“C”或“D”)中每个字母的数量,而不是每个字母的频率,但基本上是“A”,“B”,“C”和“D”的值只有0或1。如果这4个值的总和> 3,然后我想将该行的索引分配给一个新的向量,该向量将用于从数据帧中删除这些行。
myfun (x){
#which rows contain > 2 different letters of A, B, C, or D.
#The number of times each letter occurs in a given row does not matter.
#What matters is if each row contains more than 2 of the 4 letters. Each row should only contain 2 of them. The combination does not matter.
out = which(something > 2)
}
row.indexes = apply(df.sample,1,function(x) myfun(x)) #Return a vector of row indexes that contain more than 2 of the 4 letters.
new.df.sample = df.sample[-row.indexes,] #create new data frame excluding rows containing more than 2 of the 4 letters.
在上面的df.sample中,第2行和第3行包含4个字母中的2个以上,因此应将其编入索引以便删除。在通过函数运行df.sample并删除row.indexes中的行之后,我的new.df.sample数据框应该如下所示:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 A B A A/B B B B B B
4 A B A A B B B B B B
我试图将此视为4个字母中每个字母的逻辑陈述,然后为每个字母分配0或1,对它们求和,然后确定哪些字母总和为> 2.例如,我想也许我可以尝试'grep()'并将其转换为每个字母的逻辑,然后将其转换为0或1并求和。这似乎太冗长了,并没有按照我尝试的方式工作。有什么想法吗?
答案 0 :(得分:2)
这是此任务的功能。该函数返回一个逻辑值。 TRUE
表示包含两个以上不同字符串的行:
myfun <- function(x) {
sp <- unlist(strsplit(x, "/"))
length(unique(sp[sp %in% c("A", "B", "C", "D")])) > 2
}
row.indexes <- apply(df.sample, 1, myfun)
# [1] FALSE TRUE TRUE FALSE
new.df.sample <- df.sample[!row.indexes, ] # negate the index with '!'
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 A B A A/B B B B B B
# 4 A B A A B B B B B B