在列中找到值,其中列中包含大多数列出的值

时间:2018-02-12 11:39:26

标签: r dataframe dplyr

我正在尝试找到具有大多数用户给定值的ID。下面分享了一个小数据集:

ID  Val1    Val2    Time
1   A         B     12:00
1   A         C     13:10
1   C         D     13:19
2   L         O     14:00
2   A         C     15:00
2   A         M     15:00
3   P         J     16:00

搜索向量:

Vc = c("A","B","C","I","T")

搜索向量可能同时出现在Val1Val2中。我要找的结果是:

ID  Match
1   3
2   2

3 个答案:

答案 0 :(得分:1)

(假设:Vc中的值是唯一的。)
使用data.table

library("data.table")
setDT(D)
D[, sum(Vc %in% c(Val1, Val2)), ID]
D[, sum(Vc %in% c(Val1, Val2)), ID][V1>0] # without zero counts

替代代码(但逻辑相同):

D[, sum(unique(c(Val1, Val2)) %in% Vc), ID][V1>0] 

数据:

D <- read.table(header=TRUE, stringsAsFactors = FALSE, text=
"ID  Val1    Val2    Time
1   A         B     12:00
1   A         C     13:10
1   C         D     13:19
2   L         O     14:00
2   A         C     15:00
2   A         M     15:00
3   P         J     16:00")
Vc = c("A", "B", "C", "I", "T")

以下是data.table的另一种解决方案:

library("data.table")

D <- fread(
"ID  Val1    Val2    Time
1   A         B     12:00
1   A         C     13:10
1   C         D     13:19
2   L         O     14:00
2   A         C     15:00
2   A         M     15:00
3   P         J     16:00")
Vc <- data.table(V1=c("A", "B", "C", "I", "T"))

D[, .(c(Val1, Val2), ID)][Vc, on="V1", length(unique(V1)), ID]
D[, .(c(Val1, Val2), ID)][Vc, on="V1", length(unique(V1)), ID, nomatch=0] # without the NA

答案 1 :(得分:0)

[]

答案 2 :(得分:0)

您还可以将数据帧转换为长格式并进行计算:

library(tidyverse)

df %>% 
  gather(k, v, Val1:Val2) %>% 
  distinct(ID, v) %>% 
  group_by(ID) %>% 
  summarize(Match = sum(v %in% Vc)) %>% 
  filter(Match > 0)

结果:

# A tibble: 2 x 2
     ID Match
  <int> <int>
1     1     3
2     2     2