在R中,Filter表是否基于另一个表中的列值?

时间:2015-11-07 21:38:01

标签: r filter group-by filtering grouping

我在这里搜索类似的问题,但找不到答案。请你能帮助我完成这项任务。我有一张表格,其中包含超过10,000名患者的大型病历数据集,我还有另外一张桌子,仅有689名患者。我想过滤大表只保留第二个表中与患者相关的记录。然后创建一个新表,将两个表分组,然后我最终得到三个表(两个过滤表和一个合并表)。

============================我现在拥有的东西=============== ==

表1(相关患者):

ID  |  PatientID  | Record1 |  Record2 |  Record3
--------------------------------------------------------
1   |  7366       |  3      |  1      |     1
2   |  7362       |  3      |  1      |     1
3   |  7361       |  3      |  1      |     1
4   |  7360       |  3      |  1      |     1
5   |  7363       |  3      |  1      |     1

表2(所有患者):

   ID  |  PatientID  |  Blood      | SomeRecord |  Foo
    --------------------------------------------------------
    1   |  7316       |  06668      | 21/08/2015 |     1
    2   |  7302       |  08677      | 21/08/2015 |     3
    3   |  7341       |  07787      | 21/08/2015 |     2
    4   |  7340       |  08977      | 21/08/2015 |     1
    5   |  7313       |  07887      | 21/08/2015 |     1
    6   |  7366       |  56668      | 21/08/2015 |     1
    7   |  7362       |  88677      | 21/08/2015 |     3
    8   |  7361       |  77787      | 21/08/2015 |     2
    9   |  7360       |  98977      | 21/08/2015 |     1
    10  |  7363       |  87887      | 21/08/2015 |     1

我想根据表一患者ID过滤表2。该组将1和2分成一个新表。

============================ Desired Out Put ================ =====

表2(所有患者现已过滤):

   ID  |  PatientID  |  Blood      | SomeRecord |  Foo
    --------------------------------------------------------
    6   |  7366       |  56668      | 21/08/2015 |     1
    7   |  7362       |  88677      | 21/08/2015 |     3
    8   |  7361       |  77787      | 21/08/2015 |     2
    9   |  7360       |  98977      | 21/08/2015 |     1
    10  |  7363       |  87887      | 21/08/2015 |     1

表3(所有患者现已过滤,所有记录分组):

   ID  |PatientID|Blood|SomeRecord|Foo|Record1|Record2|Record3
    --------------------------------------------------------
    6  |  7366   |56668|21/08/2015 |1 |   3   |    1   |  1    
    7  |  7362   |88677|21/08/2015 |3 |   3   |    1   |  1    
    8  |  7361   |77787|21/08/2015 |2 |   3   |    1   |  1    
    9  |  7360   |98977|21/08/2015 |1 |   3   |    1   |  1    
    10 |  7363   |87887|21/08/2015 |1 |   3   |    1   |  1    

4 个答案:

答案 0 :(得分:1)

只需在dplyr加入两个:

library(dplyr)
semi_join(table2,table1, by=("PatientID"))
inner_join(table2,table1, by=("PatientID"))

<强>结果:

> semi_join(table2,table1, by=("PatientID"))
  ID PatientID Blood SomeRecord Foo
1  6      7366 56668 21/08/2015   1
2  7      7362 88677 21/08/2015   3
3  8      7361 77787 21/08/2015   2
4  9      7360 98977 21/08/2015   1
5 10      7363 87887 21/08/2015   1
> inner_join(table2,table1, by=("PatientID"))
  ID.x PatientID Blood SomeRecord Foo ID.y Record1 Record2 Record3
1    6      7366 56668 21/08/2015   1    1       3       1       1
2    7      7362 88677 21/08/2015   3    2       3       1       1
3    8      7361 77787 21/08/2015   2    3       3       1       1
4    9      7360 98977 21/08/2015   1    4       3       1       1
5   10      7363 87887 21/08/2015   1    5       3       1       1

数据

table1 <-read.table(text="ID    PatientID   Record1   Record2   Record3
1     7366         3        1           1
2     7362         3        1           1
3     7361         3        1           1
4     7360         3        1           1
5     7363         3        1           1",
header=T,stringsAsFactors =F)

table2 <-read.table(text="  ID    PatientID    Blood       SomeRecord   Foo
    1     7316         06668       21/08/2015      1
    2     7302         08677       21/08/2015      3
    3     7341         07787       21/08/2015      2
    4     7340         08977       21/08/2015      1
    5     7313         07887       21/08/2015      1
    6     7366         56668       21/08/2015      1
    7     7362         88677       21/08/2015      3
    8     7361         77787       21/08/2015      2
    9     7360         98977       21/08/2015      1
    10    7363         87887       21/08/2015      1",
header=T,stringsAsFactors =F)

答案 1 :(得分:0)

试试这个:

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery.isotope/2.2.2/isotope.pkgd.min.js"></script>
<div class="filters">
  <input type="checkbox" class="do_this_filter" value=".Hand-wash">Hand wash
  <br>
  <input type="checkbox" class="do_this_filter" value=".Machine-Wash">Machine Wash
  <br>
</div>
<ul class='products'>
  <li class="items Hand-wash">Demo product1</li>
  <li class="items Machine-Wash">Demo product2</li>
</ul>

答案 2 :(得分:0)

以下是data.table

中我将如何做到这一点
library(data.table)
setDT(table1) #convert each table _by reference_ to the data.table type
setDT(table2)

我实际上认为首先完成第二步更容易。

首先,反连接:

table3 <- table2[table1, on = "PatientID", nomatch = 0L]

我们可以将此视为一个子集,因为table1位于i;它同时是一个合并(使用on证明),即我们将table1table2合并为PatientID,只保留匹配的行table1(通过激活nomatch = 0删除不匹配的行)

接下来,过滤table2

table2 <- table3[ ,names(table2), with = FALSE]

基本上,我们只是从table1中移除table3的所有列,以获得已过滤的table2

答案 3 :(得分:0)

1)没有软件包如果DF1和DF2是两个data.frames,则MM[1:5]是必需的输出。如果不需要排序,则省略标记为##的行:

M <- merge(DF2, DF1[-1], by = "PatientID")

o <- order(M$ID) ##
M <- M[o, ] ##

,并提供:

> M[1:5]

  PatientID ID Blood SomeRecord Foo
5      7366  6 56668 21/08/2015   1
3      7362  7 88677 21/08/2015   3
2      7361  8 77787 21/08/2015   2
1      7360  9 98977 21/08/2015   1
4      7363 10 87887 21/08/2015   1

> M
  PatientID ID Blood SomeRecord Foo Record1 Record2 Record3
5      7366  6 56668 21/08/2015   1       3       1       1
3      7362  7 88677 21/08/2015   3       3       1       1
2      7361  8 77787 21/08/2015   2       3       1       1
1      7360  9 98977 21/08/2015   1       3       1       1
4      7363 10 87887 21/08/2015   1       3       1       1

2)sqldf

> library(sqldf)
> sqldf("select b.* from DF1 a join DF2 b using (PatientID)")

  ID PatientID Blood SomeRecord Foo
1  6      7366 56668 21/08/2015   1
2  7      7362 88677 21/08/2015   3
3  8      7361 77787 21/08/2015   2
4  9      7360 98977 21/08/2015   1
5 10      7363 87887 21/08/2015   1

> sqldf("select b.*, a.* from DF1 a join DF2 b using (PatientID)")

  ID PatientID Blood SomeRecord Foo ID PatientID Record1 Record2 Record3
1  6      7366 56668 21/08/2015   1  1      7366       3       1       1
2  7      7362 88677 21/08/2015   3  2      7362       3       1       1
3  8      7361 77787 21/08/2015   2  3      7361       3       1       1
4  9      7360 98977 21/08/2015   1  4      7360       3       1       1
5 10      7363 87887 21/08/2015   1  5      7363       3       1       1

注意:输入为:

Lines1 <- "ID  |  PatientID  | Record1 |  Record2 |  Record3
1   |  7366       |  3      |  1      |     1
2   |  7362       |  3      |  1      |     1
3   |  7361       |  3      |  1      |     1
4   |  7360       |  3      |  1      |     1
5   |  7363       |  3      |  1      |     1"

Lines2 <- " ID  |  PatientID  |  Blood      | SomeRecord |  Foo
    1   |  7316       |  06668      | 21/08/2015 |     1
    2   |  7302       |  08677      | 21/08/2015 |     3
    3   |  7341       |  07787      | 21/08/2015 |     2
    4   |  7340       |  08977      | 21/08/2015 |     1
    5   |  7313       |  07887      | 21/08/2015 |     1
    6   |  7366       |  56668      | 21/08/2015 |     1
    7   |  7362       |  88677      | 21/08/2015 |     3
    8   |  7361       |  77787      | 21/08/2015 |     2
    9   |  7360       |  98977      | 21/08/2015 |     1
    10  |  7363       |  87887      | 21/08/2015 |     1"

DF1 <- read.table(text = Lines1, header = TRUE, sep = "|", strip.white = TRUE)
DF2 <- read.table(text = Lines2, header = TRUE, sep = "|", strip.white = TRUE)