我需要合并两个数据帧。第一个看起来像这样:
> df1 <- data.frame(Artist = c("Vincent van ", "Vincent van ", "Theo van Gogh", "Alexandre", "Alexandre"), Location = c("a","a","a","b","c"), time = c(1,2,1,1,1))
> df1
Artist Location time
1 Vincent van a 1
2 Vincent van a 2
3 Theo van Gogh a 1
4 Alexandre b 1
5 Alexandre c 1
第二个:
> df2 <- data.frame(Artist = c("Vincent van Gogh", "Theo van Gogh", "Alexandre Dumas", "Alexandre Dumas"), HomeNumber = c(123,234,456,789), Location = c( "a","a","b","c"))
> df2
Artist HomeNumber Location
1 Vincent van Gogh 123 a
2 Theo van Gogh 234 a
3 Alexandre Dumas 456 b
4 Alexandre Dumas 789 c
我想要这个数据框:
> df3 <- data.frame(Artist = c("Vincent van ", "Vincent van ", "Theo van Gogh", "Alexandre", "Alexandre"), Location = c("a","a","a","b","c"), time = c(1,2,1,1,1), HomeNumber = c(123,123,234,456,789))
> df3
Artist Location time HomeNumber
1 Vincent van a 1 123
2 Vincent van a 2 123
3 Theo van Gogh a 1 234
4 Alexandre b 1 456
5 Alexandre c 1 789
>
合并只适用于Theo:
> df3 <- merge(df1, df2, by.x = "Artist", by.y = "Artist", all.x =TRUE)
> df3
Artist Location.x time HomeNumber Location.y
1 Alexandre b 1 NA <NA>
2 Alexandre c 1 NA <NA>
3 Theo van Gogh a 1 234 a
4 Vincent van a 1 NA <NA>
5 Vincent van a 2 NA <NA>
原因有两个:
(a)文森特在df1
中遗漏了他姓氏的一部分。
(b)亚历山大是亚历山大·杜马斯大四和亚历山大·杜马斯大三的名字。
我可以使用df1$Artist <- gsub("Vincent van $","Vincent van Gogh", df1$Artist)
解决(a),但我的数据实际上非常大,在执行gsub
之前,我必须首先了解文森特的全名。一种可能的解决方案是在df2中使用grep("Vincent van "...
,构建一个函数,如果结果向量的长度是1
,我会使用gsub
来使用返回的df2$Artist
来{ {1}}。我不知道该怎么做。
(b)对我来说有点棘手。我能想到的一个解决方案(但不能编码)是首先使用df1
函数从一个位置选择Alexandre,然后使用解决方案(a)到if
名称。
我认为解决(a)和(b)会返回我想要的gsub
。你们有任何想法如何有效地合并这些数据框架吗?谢谢!
编辑:请注意,df3
实际上是两个不同的单位。因此,合并两者时应该有相关的HomeNumber和Location。 Alexandre
是一个单位,但有两个观察时间。
答案 0 :(得分:2)
您希望该结果受到以下事实的破坏:您需要在每个数据帧中有两行要考虑具有相同的id,即Alexandre
行。 JOIN过程将使其成为2 x 2匹配:
df2$short <- substr(df2$Artist, 1,7)
df1$short <- substr(df1$Artist, 1,7)
(dfmer <- merge(df1, df2, by="short") )
#-----
short Artist.x Location.x time Artist.y HomeNumber Location.y
1 Alexand Alexandre b 1 Alexandre Dumas 456 b
2 Alexand Alexandre b 1 Alexandre Dumas 789 c
3 Alexand Alexandre c 1 Alexandre Dumas 456 b
4 Alexand Alexandre c 1 Alexandre Dumas 789 c
5 Theo va Theo van Gogh a 1 Theo van Gogh 234 a
6 Vincent Vincent van a 1 Vincent van Gogh 123 a
7 Vincent Vincent van a 2 Vincent van Gogh 123 a
如果您想挑选第一个实例,可以在地点和时间上使用!复制:
> dfmer[!duplicated( dfmer[, c("Location.x", "time")]), ]
short Artist.x Location.x time Artist.y HomeNumber Location.y
1 Alexand Alexandre b 1 Alexandre Dumas 456 b
3 Alexand Alexandre c 1 Alexandre Dumas 456 b
5 Theo va Theo van Gogh a 1 Theo van Gogh 234 a
7 Vincent Vincent van a 2 Vincent van Gogh 123 a
回应关注点(之前未提出过需要将Location添加为链接变量:
> (dfmer <- merge(df1, df2, by=c("short", "Location") ) )
short Location Artist.x time Artist.y HomeNumber
1 Alexand b Alexandre 1 Alexandre Dumas 456
2 Alexand c Alexandre 1 Alexandre Dumas 789
3 Theo va a Theo van Gogh 1 Theo van Gogh 234
4 Vincent a Vincent van 1 Vincent van Gogh 123
5 Vincent a Vincent van 2 Vincent van Gogh 123