删除跨列的值包含4个以上唯一字符中的2个的行

时间:2014-01-22 15:19:03

标签: r dataframe apply

希望标题的措辞有意义。我有一个由值组成的数据框:“A”,“B”,“C”,“D”,“”,“A / B”。我想确定哪些行只包含2个“A”,“B”,“C”或“D”。行中每个字母的频率无关紧要。我只是想知道这行中是否有超过2个字母。

以下是一个示例数据框:

    df.sample = as.data.frame(rbind(c("A","B","A","A/B","B","B","B","B","","B"),c("A","B","C","A","B","","","B","","B"),c("A","B","D","D","B","B","B","B","","B"),c("A","B","A","A","B","B","B","B","B","B")))
    df.sample

      V1 V2 V3  V4 V5 V6 V7 V8 V9 V10
    1  A  B  A A/B  B  B  B  B      B
    2  A  B  C   A  B        B      B
    3  A  B  D   D  B  B  B  B      B
    4  A  B  A   A  B  B  B  B  B   B

我想在每一行中应用一个函数来确定4个字母(“A”,“B”,“C”或“D”)中每个字母的数量,而不是每个字母的频率,但基本上是“A”,“B”,“C”和“D”的值只有0或1。如果这4个值的总和> 3,然后我想将该行的索引分配给一个新的向量,该向量将用于从数据帧中删除这些行。

    myfun (x){
      #which rows contain > 2 different letters of A, B, C, or D.
      #The number of times each letter occurs in a given row does not matter. 
      #What matters is if each row contains more than 2 of the 4 letters. Each row should only contain 2 of them. The combination does not matter.

      out = which(something > 2)
    }

    row.indexes = apply(df.sample,1,function(x) myfun(x)) #Return a vector of row indexes that contain more than 2 of the 4 letters.

    new.df.sample = df.sample[-row.indexes,] #create new data frame excluding rows containing more than 2 of the 4 letters.

在上面的df.sample中,第2行和第3行包含4个字母中的2个以上,因此应将其编入索引以便删除。在通过函数运行df.sample并删除row.indexes中的行之后,我的new.df.sample数据框应该如下所示:

      V1 V2 V3  V4 V5 V6 V7 V8 V9 V10
    1  A  B  A A/B  B  B  B  B      B
    4  A  B  A   A  B  B  B  B  B   B

我试图将此视为4个字母中每个字母的逻辑陈述,然后为每个字母分配0或1,对它们求和,然后确定哪些字母总和为> 2.例如,我想也许我可以尝试'grep()'并将其转换为每个字母的逻辑,然后将其转换为0或1并求和。这似乎太冗长了,并没有按照我尝试的方式工作。有什么想法吗?

1 个答案:

答案 0 :(得分:2)

这是此任务的功能。该函数返回一个逻辑值。 TRUE表示包含两个以上不同字符串的行:

myfun <- function(x) {
  sp <- unlist(strsplit(x, "/"))
  length(unique(sp[sp %in% c("A", "B", "C", "D")])) > 2
}

row.indexes <- apply(df.sample, 1, myfun)
# [1] FALSE  TRUE  TRUE FALSE

new.df.sample <- df.sample[!row.indexes, ] # negate the index with '!'

#   V1 V2 V3  V4 V5 V6 V7 V8 V9 V10
# 1  A  B  A A/B  B  B  B  B      B
# 4  A  B  A   A  B  B  B  B  B   B