Question

我需要合并两个数据帧。第一个看起来像这样：

> df1 <- data.frame(Artist = c("Vincent van ", "Vincent van ", "Theo van Gogh", "Alexandre", "Alexandre"), Location = c("a","a","a","b","c"), time = c(1,2,1,1,1))
> df1
         Artist Location time
1  Vincent van         a    1
2  Vincent van         a    2
3 Theo van Gogh        a    1
4     Alexandre        b    1
5     Alexandre        c    1

第二个：

> df2 <- data.frame(Artist = c("Vincent van Gogh", "Theo van Gogh", "Alexandre Dumas", "Alexandre Dumas"), HomeNumber = c(123,234,456,789), Location = c( "a","a","b","c"))
> df2
            Artist HomeNumber Location
1 Vincent van Gogh        123        a
2    Theo van Gogh        234        a
3  Alexandre Dumas        456        b
4  Alexandre Dumas        789        c

我想要这个数据框：

> df3 <- data.frame(Artist = c("Vincent van ", "Vincent van ", "Theo van Gogh", "Alexandre", "Alexandre"), Location = c("a","a","a","b","c"), time = c(1,2,1,1,1), HomeNumber = c(123,123,234,456,789))
> df3
         Artist Location time HomeNumber
1  Vincent van         a    1        123
2  Vincent van         a    2        123
3 Theo van Gogh        a    1        234
4     Alexandre        b    1        456
5     Alexandre        c    1        789
>

合并只适用于Theo：

    > df3 <- merge(df1, df2, by.x = "Artist", by.y = "Artist", all.x =TRUE)
> df3
         Artist Location.x time HomeNumber Location.y
1     Alexandre          b    1         NA       <NA>
2     Alexandre          c    1         NA       <NA>
3 Theo van Gogh          a    1        234          a
4  Vincent van           a    1         NA       <NA>
5  Vincent van           a    2         NA       <NA>

原因有两个：（a）文森特在df1中遗漏了他姓氏的一部分。（b）亚历山大是亚历山大·杜马斯大四和亚历山大·杜马斯大三的名字。

我可以使用df1$Artist <- gsub("Vincent van $","Vincent van Gogh", df1$Artist)解决（a），但我的数据实际上非常大，在执行gsub之前，我必须首先了解文森特的全名。一种可能的解决方案是在df2中使用grep("Vincent van "...，构建一个函数，如果结果向量的长度是1，我会使用gsub来使用返回的df2$Artist来{ {1}}。我不知道该怎么做。

（b）对我来说有点棘手。我能想到的一个解决方案（但不能编码）是首先使用df1函数从一个位置选择Alexandre，然后使用解决方案（a）到if名称。

我认为解决（a）和（b）会返回我想要的gsub。你们有任何想法如何有效地合并这些数据框架吗？谢谢！

编辑：请注意，df3实际上是两个不同的单位。因此，合并两者时应该有相关的HomeNumber和Location。 Alexandre是一个单位，但有两个观察时间。

Answer 1

您希望该结果受到以下事实的破坏：您需要在每个数据帧中有两行要考虑具有相同的id，即Alexandre行。 JOIN过程将使其成为2 x 2匹配：

df2$short <- substr(df2$Artist, 1,7)
df1$short <- substr(df1$Artist, 1,7)
(dfmer <- merge(df1, df2, by="short") )
#-----
    short      Artist.x Location.x time         Artist.y HomeNumber Location.y
1 Alexand     Alexandre          b    1  Alexandre Dumas        456          b
2 Alexand     Alexandre          b    1  Alexandre Dumas        789          c
3 Alexand     Alexandre          c    1  Alexandre Dumas        456          b
4 Alexand     Alexandre          c    1  Alexandre Dumas        789          c
5 Theo va Theo van Gogh          a    1    Theo van Gogh        234          a
6 Vincent  Vincent van           a    1 Vincent van Gogh        123          a
7 Vincent  Vincent van           a    2 Vincent van Gogh        123          a

如果您想挑选第一个实例，可以在地点和时间上使用！复制：

> dfmer[!duplicated( dfmer[, c("Location.x", "time")]), ]
    short      Artist.x Location.x time         Artist.y HomeNumber Location.y
1 Alexand     Alexandre          b    1  Alexandre Dumas        456          b
3 Alexand     Alexandre          c    1  Alexandre Dumas        456          b
5 Theo va Theo van Gogh          a    1    Theo van Gogh        234          a
7 Vincent  Vincent van           a    2 Vincent van Gogh        123          a

回应关注点（之前未提出过需要将Location添加为链接变量：

> (dfmer <- merge(df1, df2, by=c("short", "Location") ) )
    short Location      Artist.x time         Artist.y HomeNumber
1 Alexand        b     Alexandre    1  Alexandre Dumas        456
2 Alexand        c     Alexandre    1  Alexandre Dumas        789
3 Theo va        a Theo van Gogh    1    Theo van Gogh        234
4 Vincent        a  Vincent van     1 Vincent van Gogh        123
5 Vincent        a  Vincent van     2 Vincent van Gogh        123

将数据框与可预测的拼写错误合并

1 个答案: