对于每一行,检查一列中的值是否存在于另外两列中

时间:2016-12-04 18:22:15

标签: r dplyr

假设我们有以下数据框:

df <- data.frame(X1 = 1:5, X2 = 6:10, X3 = c(6, 2, 3, 0, 2))

  X1 X2 X3
1  1  6  6
2  2  7  2
3  3  8  3
4  4  9  0
5  5 10  2

我想添加一个由逻辑值组成的新列(X4)。对于每一行:如果X3等于X1X2,则X4应为TRUE,否则为FALSE

我试过了:

mutate(df, X4 = X3 %in% c(X2, X1))

  X1 X2 X3    X4
1  1  6  6  TRUE # OK
2  2  7  2  TRUE # OK
3  3  8  3  TRUE # OK
4  4  9  0 FALSE # OK
5  5 10  2  TRUE # expected to be FALSE

最重要的是,我的真实df 非常大,所以我想避免使用for循环。我会特权最短(代码量少)和最快的解决方案。

3 个答案:

答案 0 :(得分:2)

你可以做这个矢量化,这是最快的:

   do {
      if let json = try JSONSerialization.jsonObject(with:data!, options: []) as? JSONDictionary {
        print(json["StudentName"] as! String)
        if let days = json["Days"] as? [JSONDictionary] {
            for day in days {
                print(day["DayName"] as! String)
                if let lessons = day["Lessons"] as? [JSONDictionary] {
                    for lesson in lessons {
                        let classRoom = lesson["Classroom"] as! String
                        let name = lesson["Name"] as! String
                        let teacher = lesson["Teacher"] as! String

                        print(classRoom, name, teacher)
                    }

                }
            }
        }

    }
  } catch {
    print(error)
  }

<强>基准

df$X4 <- with(df, X3==X1 | X3==X2)

答案 1 :(得分:1)

我们可以使用Reduce

Reduce(`|`, lapply(df[1:2], `==`, df[,3]))
#[1]  TRUE  TRUE  TRUE FALSE FALSE

基准

更大的数据更有意义

library(microbenchmark)
set.seed(24)
df <- data.frame(X1= sample(1:5, 1e6, replace=TRUE), X2 = sample(1:10, 1e6, replace=TRUE),
       X3 = sample(1:10, 1e6, replace=TRUE))

f2 <- function(df) Reduce(`|`, lapply(df[1:2], `==`, df[,3]))
f3 <- function(df) with(df, X3==X1 | X3==X2)
microbenchmark(f1(df), f2(df), f3(df))
#Unit: milliseconds
#   expr         min         lq       mean     median         uq      max neval

# f2(df)    8.191218   10.83333   23.28081   16.42744   22.26866  143.025   100
# f3(df)    8.154506   10.58878   19.17879   11.49179   22.41255  144.510   100

我认为apply速度较慢,但​​Reduce并不慢..

答案 2 :(得分:1)

使用的解决方案。

library(dplyr)

df %>%
  rowwise() %>%
  mutate(X4 = any(c(X1, X2) %in% X3)) %>%
  ungroup()

# # A tibble: 5 x 4
#      X1    X2    X3 X4   
#   <int> <int> <dbl> <lgl>
# 1     1     6  6.00 T    
# 2     2     7  2.00 T    
# 3     3     8  3.00 T    
# 4     4     9  0    F    
# 5     5    10  2.00 F