我有两个数据框。
df1
)是协调字符串names1
和names2
及其frequency
的频率数据帧。 df2
)包含两列names1
和names2
,其中包含一对或多对这些对。有时顺序不同。我想在第一个dafarame df1
,
df1 <- tibble(names1 = c('architecture', 'assessment', 'build'),
names2 = c('build', 'data', 'data'),
frequency = c(36,13,720))
# A tibble: 3 x 3
names1 names2 frequency
<chr> <chr> <dbl>
1 architecture build 36
2 assessment data 13
3 build data 720
在第二个数据帧df2
中。
df2 <- tibble(names1 = c('architecture', 'build', 'assessment','assessment', 'business'),
names2 = c('build','architecture', 'data', 'data', 'strategy'))
names1 names2
<chr> <chr>
1 architecture build
2 build architecture
3 assessment data
4 assessment data
5 business strategy
对于此结果:
names1 names2 frequency
<chr> <chr> <dbl>
1 architecture build 36
2 build architecture 36
3 assessment data 13
4 assessment data 13
5 business strategy 0
注意:有时候我有df1$names1 == df2$names1 && df1$names2 == df2$names2
或 df1$names1 == df2$names2 && df1$names2 == df2$names1
1 architecture build 36
2 build architecture 36
注意:我想保留不匹配的行
5 business strategy 0
答案 0 :(得分:2)
这里的问题是,名称列的顺序对于连接很重要,因此您必须更新数据集并应用一致的顺序。
这是一个dplyr
解决方案:
library(dplyr)
df1 <- tibble(names1 = c('architecture', 'assessment', 'build'),
names2 = c('build', 'data', 'data'),
frequency = c(36,13,720))
df2 <- tibble(names1 = c('architecture', 'build', 'assessment','assessment', 'business'),
names2 = c('build','architecture', 'data', 'data', 'strategy'))
# update df1
df1 = df1 %>%
rowwise() %>%
mutate(names = paste0(sort(c(names1, names2)), collapse = "_")) %>%
select(names, frequency)
# update df2
df2 = df2 %>%
rowwise() %>%
mutate(names = paste0(sort(c(names1, names2)), collapse = "_"))
# join datasets and update columns
left_join(df2, df1, by="names") %>%
mutate(frequency = coalesce(frequency, 0)) %>%
select(-names) %>%
ungroup()
# names1 names2 frequency
# <chr> <chr> <dbl>
# 1 architecture build 36
# 2 build architecture 36
# 3 assessment data 13
# 4 assessment data 13
# 5 business strategy 0
答案 1 :(得分:2)
具有一些tidyr::left_join
解决方案的递归dplyr
:
require(dplyr
require(tidyr)
left_join(df2,df1,by=c("names1","names2")) %>%
left_join(df1,by=c(names1="names2",names2="names1")) %>%
mutate(frequency=coalesce(frequency.x,frequency.y,0)) %>%
select(-frequency.x,-frequency.y)
此解决方案保留df2中各列的顺序。之所以有mutate和select行,是因为left_join添加了新列,需要将这些新列合并回单个频率列(并将NA替换为0),然后将其删除。
结果:
# A tibble: 5 x 3
names1 names2 frequency
<chr> <chr> <dbl>
1 architecture build 36
2 build architecture 36
3 assessment data 13
4 assessment data 13
5 business strategy 0