Question

我在R中有2个R data.table，如此：

first_table

id | first | trunc | val1
=========================
 1 |   Bob | Smith |   10
 2 |   Sue | Goldm |   20
 3 |   Sue | Wollw |   30
 4 |   Bob | Bellb |   40

second_table

id | first |       last | val2
==============================
 1 |   Bob |      Smith |    A
 2 |   Bob |      Smith |    B
 3 |   Sue |    Goldman |    A
 4 |   Sue |    Goldman |    B
 5 |   Sue |  Wollworth |    A
 6 |   Sue |  Wollworth |    B
 7 |   Bob | Bellbottom |    A
 8 |   Bob | Bellbottom |    B

如您所见，第一个表中的姓氏被截断。此外，名字和姓氏的组合在第一个表中是唯一的，但在第二个表中不是唯一的。我想＆＃34;加入＆＃34;

这个令人难以置信的天真假设下的名字和姓氏的组合

首先，最后一个唯一定义一个人
截断姓氏不会引起歧义。

结果应如下所示：

id | first | trunc |       last | val1 
=======================================
 1 |   Bob | Smith |      Smith |   10
 2 |   Sue | Goldm |    Goldman |   20
 3 |   Sue | Wollw |  Wollworth |   30
 4 |   Bob | Bellb | Bellbottom |   40

基本上，对于table_1中的每一行，我需要找到一个返回填充姓氏的行。

对于first_table中的每一行：在second_table中找到第一行：匹配first_name＆amp; trunc是last的子串然后加入该行

是否有一种简单的矢量化方法可以使用data.table完成此操作？

Answer 1

一种方法是加入first，然后根据子字符串匹配

进行过滤

first_table[
    unique(second_table[, .(first, last)])
    , on = "first"
    , nomatch = 0
][
    substr(last, 1, nchar(trunc)) == trunc
]

#    id first trunc val1       last
# 1:  1   Bob Smith   10      Smith
# 2:  2   Sue Goldm   20    Goldman
# 3:  3   Sue Wollw   30  Wollworth
# 4:  4   Bob Bellb   40 Bellbottom

或者，在second_table上进行截断以匹配第一个，然后在两个列上加入

first_table[
    unique(second_table[, .(first, last, trunc = substr(last, 1, 5))])
    , on = c("first", "trunc")
    , nomatch = 0
]
## yields the same answer

R data.table为每一行添加包含查询的新列

1 个答案: