我有一个国家之间双边关系的数据框:
C1 C2
US FR
FR US
US DE
DE US
US RU
US FI
RU FI
FI RU
链接是定向链接,其中一些链接丢失(例如,我有US> RU,但没有RU> US)。我想找出所有唯一的配对。拥有这样的东西:
C1 C2 PairID
US FR 1
FR US 1
US DE 2
DE US 2
US RU -
US FI -
RU FI 3
FI RU 3
有什么建议吗?
答案 0 :(得分:2)
这里是一种选择,假设您还希望计算像US>RU
这样的非双向关系:
library(dplyr)
df %>%
mutate(relation = paste(pmin(C1, C2), pmax(C1, C2), sep = "-"), #define the relation no matter the direction
PairID = cumsum(c(1, head(relation, -1) != tail(relation, -1)))) %>%
select(-relation)
# output
C1 C2 PairID
1 US FR 1
2 FR US 1
3 US DE 2
4 DE US 2
5 US RU 3
6 US FI 4
7 RU FI 5
8 FI RU 5
# Data: df
structure(list(C1 = c("US", "FR", "US", "DE", "US", "US", "RU",
"FI"), C2 = c("FR", "US", "DE", "US", "RU", "FI", "FI", "RU")), .Names = c("C1",
"C2"), class = "data.frame", row.names = c(NA, -8L))
答案 1 :(得分:1)
我们可以创建一个字符串标识符,以捕获给定的一对国家/地区,而与它们的顺序无关:
library( tidyverse )
# Original data
X <- data_frame(C1 = c("US", "FR", "US", "DE", "US", "US", "RU", "FI"),
C2 = c("FR", "US", "DE", "US", "RU", "FI", "FI", "RU"))
# Creates an order-independent string ID for each entry
Y <- X %>% mutate( S = map2_chr( C1, C2, ~str_flatten(sort(c(.x,.y))) ) )
# # A tibble: 8 x 3
# C1 C2 S
# <chr> <chr> <chr>
# 1 US FR FRUS
# 2 FR US FRUS
# 3 US DE DEUS
# 4 DE US DEUS
# 5 US RU RUUS
# ...
然后,我们可以使用这些字符串标识符查找在两个方向上都出现的国家/地区对(例如US > FR
和FR > US
)。这些对将具有两个匹配的字符串ID。
# Identify string IDs with both orderings and assign an integer ID to each
Z <- Y %>% group_by(S) %>% filter( n() == 2 ) %>% ungroup %>% # Keep groups of size 2
select(S) %>% distinct %>% mutate( PairID = 1:n() ) # Annotate unique values
# # A tibble: 3 x 2
# S PairID
# <chr> <int>
# 1 FRUS 1
# 2 DEUS 2
# 3 FIRU 3
剩下要做的就是将新的字符串ID->整数ID映射与原始数据连接起来,并将NA替换为"-"
:
left_join( Y, Z ) %>% select(-S) %>% mutate_at( "PairID", replace_na, "-")
# # A tibble: 8 x 3
# C1 C2 PairID
# <chr> <chr> <chr>
# 1 US FR 1
# 2 FR US 1
# 3 US DE 2
# 4 DE US 2
# 5 US RU -
# 6 US FI -
# 7 RU FI 3
# 8 FI RU 3