R中的模糊外部联接/合并

时间:2018-08-21 23:47:01

标签: r data.table outer-join fuzzy-logic fuzzyjoin

我有2个数据集,想要进行模糊连接。
这是两个数据集。

.container2{
    float: left;
    width: 50%;
    background-color: rgba(40,149,68,0.9);
    color: white;
    font-size: 55px;
}

两个数据集在library(data.table) # data1 dt1 <- fread("NAME State type ABERCOMBIE TOWNSHIP ND TS ABERDEEN TOWNSHIP NJ TS ABERDEEN TOWNSHIP SD TS ABBOTSFORD CITY WI CI ABERDEEN CITY WA CI ADA TOWNSHIP MI TS ADAMS IL TS", header = T) # data2 dt2 <- fread("NAME State type ABERDEEN TWP N J NJ TS ABERDEEN WASH WA CI ABBOTSFORD WIS WI CI ADA TWP MICH MI TS ADA OHIO OH CI ADAMS MASS MA CI ADAMSVILLE ALA AL CI", header = T) State中具有相同的字符;但是,列type不同。它们是相似的。
尽管我可以用3或4个宪章减去每个数据上的NAME列,然后将它们合并,但由于观察到的大量数据,似乎正确的比率可能不高。

NAME

方法不好。

我检查软件包dt1$NameSubstr <- substr(dt1$NAME, 1, 4) dt2$NameSubstr <- substr(dt2$NAME, 1, 4) merge(dt1, dt2, by = c("NameSubstr", "State", "type"), all = T) 。但不确定我是否正确。

fuzzyjoin

此练习中的结果是正确的,请参见下文。但是,如果这两个数据中的任何NAME相同,答案将不正确。
我在这两个数据中创建了一个新观察值。

library(fuzzyjoin)
fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`!=`, `==`, `==`))  

# Results
                 NAME.x State.x type.x           NAME.y State.y type.y
 1:   ABERDEEN TOWNSHIP      NJ     TS ABERDEEN TWP N J      NJ     TS
 2:     ABBOTSFORD CITY      WI     CI   ABBOTSFORD WIS      WI     CI
 3:       ABERDEEN CITY      WA     CI    ABERDEEN WASH      WA     CI
 4:        ADA TOWNSHIP      MI     TS     ADA TWP MICH      MI     TS
 5: ABERCOMBIE TOWNSHIP      ND     TS             <NA>    <NA>   <NA>
 6:   ABERDEEN TOWNSHIP      SD     TS             <NA>    <NA>   <NA>
 7:               ADAMS      IL     TS             <NA>    <NA>   <NA>
 8:                <NA>    <NA>   <NA>         ADA OHIO      OH     CI
 9:                <NA>    <NA>   <NA>       ADAMS MASS      MA     CI
10:                <NA>    <NA>   <NA>   ADAMSVILLE ALA      AL     CI

这是不正确的结果。 有什么建议吗?

似乎我不能使用dt1 <- fread("NAME State type ABERCOMBIE TOWNSHIP ND TS ABERDEEN TOWNSHIP NJ TS ABERDEEN TOWNSHIP SD TS ABBOTSFORD CITY WI CI ABERDEEN CITY WA CI ADA TOWNSHIP MI TS ADAMS IL TS THE SAME AA BB ", header = T) dt2 <- fread("NAME State type ABERDEEN TWP N J NJ TS ABERDEEN WASH WA CI ABBOTSFORD WIS WI CI ADA TWP MICH MI TS ADA OHIO OH CI ADAMS MASS MA CI ADAMSVILLE ALA AL CI THE SAME AA BB ", header = T) fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`!=`, `==`, `==`)) NAME.x State.x type.x NAME.y State.y type.y 1: ABERDEEN TOWNSHIP NJ TS ABERDEEN TWP N J NJ TS 2: ABBOTSFORD CITY WI CI ABBOTSFORD WIS WI CI 3: ABERDEEN CITY WA CI ABERDEEN WASH WA CI 4: ADA TOWNSHIP MI TS ADA TWP MICH MI TS 5: ABERCOMBIE TOWNSHIP ND TS <NA> <NA> <NA> 6: ABERDEEN TOWNSHIP SD TS <NA> <NA> <NA> 7: ADAMS IL TS <NA> <NA> <NA> 8: THE SAME AA BB <NA> <NA> <NA> 9: <NA> <NA> <NA> ADA OHIO OH CI 10: <NA> <NA> <NA> ADAMS MASS MA CI 11: <NA> <NA> <NA> ADAMSVILLE ALA AL CI 12: <NA> <NA> <NA> THE SAME AA BB

1 个答案:

答案 0 :(得分:0)

这是因为您要求Fuzzy_full_join为您提供不匹配的名称(使用!=),然后声明确实匹配的名称和类型(使用== ==)。因此,如果所有三个都匹配,则不会显示。

您可以使用以下命令运行两次:

void Start ()
{
    creatingArray(); //creating first array
}

void Update ()
{
    xValueForArray = tempArray.GetComponent<arrayOfBoxes>().Xvalue; // position on x of the last element in array
    spawnerVal = spawner.transform.position.x - valX; // position from which I spawn arrays of elements
    if (xValueForArray < spawnerVal) // if value of xValueForArray is less than value of spawnerVal call creatingArray();
        creatingArray();
}

void creatingArray()
{
    int poz = Random.Range(0, obstacle.Length);
    GameObject temp = Instantiate(obstacle[poz], spawner.transform.position, spawner.transform.rotation);
    tempArray = temp;
}
match_fun = list(`!=`, `==`, `==`))
match_fun = list(`==`, `==`, `==`))

reprex package(v0.2.1)于2019-03-17创建