Question

希望标题的措辞有意义。我有一个由值组成的数据框：“A”，“B”，“C”，“D”，“”，“A / B”。我想确定哪些行只包含2个“A”，“B”，“C”或“D”。行中每个字母的频率无关紧要。我只是想知道这行中是否有超过2个字母。

以下是一个示例数据框：

    df.sample = as.data.frame(rbind(c("A","B","A","A/B","B","B","B","B","","B"),c("A","B","C","A","B","","","B","","B"),c("A","B","D","D","B","B","B","B","","B"),c("A","B","A","A","B","B","B","B","B","B")))
    df.sample

      V1 V2 V3  V4 V5 V6 V7 V8 V9 V10
    1  A  B  A A/B  B  B  B  B      B
    2  A  B  C   A  B        B      B
    3  A  B  D   D  B  B  B  B      B
    4  A  B  A   A  B  B  B  B  B   B

我想在每一行中应用一个函数来确定4个字母（“A”，“B”，“C”或“D”）中每个字母的数量，而不是每个字母的频率，但基本上是“A”，“B”，“C”和“D”的值只有0或1。如果这4个值的总和> 3，然后我想将该行的索引分配给一个新的向量，该向量将用于从数据帧中删除这些行。

    myfun (x){
      #which rows contain > 2 different letters of A, B, C, or D.
      #The number of times each letter occurs in a given row does not matter. 
      #What matters is if each row contains more than 2 of the 4 letters. Each row should only contain 2 of them. The combination does not matter.

      out = which(something > 2)
    }

    row.indexes = apply(df.sample,1,function(x) myfun(x)) #Return a vector of row indexes that contain more than 2 of the 4 letters.

    new.df.sample = df.sample[-row.indexes,] #create new data frame excluding rows containing more than 2 of the 4 letters.

在上面的df.sample中，第2行和第3行包含4个字母中的2个以上，因此应将其编入索引以便删除。在通过函数运行df.sample并删除row.indexes中的行之后，我的new.df.sample数据框应该如下所示：

      V1 V2 V3  V4 V5 V6 V7 V8 V9 V10
    1  A  B  A A/B  B  B  B  B      B
    4  A  B  A   A  B  B  B  B  B   B

我试图将此视为4个字母中每个字母的逻辑陈述，然后为每个字母分配0或1，对它们求和，然后确定哪些字母总和为＆gt; 2.例如，我想也许我可以尝试'grep（）'并将其转换为每个字母的逻辑，然后将其转换为0或1并求和。这似乎太冗长了，并没有按照我尝试的方式工作。有什么想法吗？

Answer 1

这是此任务的功能。该函数返回一个逻辑值。 TRUE表示包含两个以上不同字符串的行：

myfun <- function(x) {
  sp <- unlist(strsplit(x, "/"))
  length(unique(sp[sp %in% c("A", "B", "C", "D")])) > 2
}

row.indexes <- apply(df.sample, 1, myfun)
# [1] FALSE  TRUE  TRUE FALSE

new.df.sample <- df.sample[!row.indexes, ] # negate the index with '!'

#   V1 V2 V3  V4 V5 V6 V7 V8 V9 V10
# 1  A  B  A A/B  B  B  B  B      B
# 4  A  B  A   A  B  B  B  B  B   B

删除跨列的值包含4个以上唯一字符中的2个的行

1 个答案: