在两列中查找重复项

时间：2019-01-16 11:32:37

标签： r dplyr

我有一个用corr <- cor(data, use = "pairwise.complete.obs")生成的相关矩阵。我使用此代码将数据转换为长格式，并过滤了> 0.1的相关性：

corr %>% 
  as_tibble(rownames = "From") %>% 
  gather(key = "To", value = "corr", -From) %>% 
  filter(!From == To) %>%
  mutate(corr_abs = abs(corr)) %>%
  filter(corr_abs > 0.1) %>% 
  arrange(-corr_abs)

但是，结果两次提及每个相关性。当值位于两个不同的列中时，如何删除这些重复项？

输出

# A tibble: 8 x 4
  From            To                corr corr_abs
  <chr>           <chr>            <dbl>    <dbl>
1 health.age      health.employed -0.393    0.393
2 health.employed health.age      -0.393    0.393
3 health.age      health.marital   0.212    0.212
4 health.marital  health.age       0.212    0.212
5 health.alcohol  health.gender    0.187    0.187
6 health.gender   health.alcohol   0.187    0.187
7 health.age      health.fruitveg  0.100    0.100
8 health.fruitveg health.age       0.100    0.100

预期

# A tibble: 8 x 4
  From            To                corr corr_abs
  <chr>           <chr>            <dbl>    <dbl>
1 health.age      health.employed -0.393    0.393
2 health.age      health.marital   0.212    0.212
3 health.alcohol  health.gender    0.187    0.187
4 health.age      health.fruitveg  0.100    0.100

数据

corr <- structure(c(1, 0.0632225392922264, 0.0554804788901363, 0.0974838182384356, 
0.212473674076218, -0.0286618705621989, 0.0632225392922264, 1, 
0.0908529910265203, -0.0554639294179715, -0.0326865391045356, 
0.186574369192519, 0.0554804788901363, 0.0908529910265203, 1, 
0.0377351030257117, -0.392764651422931, 0.065822234809157, 0.0974838182384356, 
-0.0554639294179715, 0.0377351030257117, 1, 0.10048775378073, 
-0.0684000695994252, 0.212473674076218, -0.0326865391045356, 
-0.392764651422931, 0.10048775378073, 1, -0.0312405196930598, 
-0.0286618705621989, 0.186574369192519, 0.065822234809157, -0.0684000695994252, 
-0.0312405196930598, 1), .Dim = c(6L, 6L), .Dimnames = list(c("health.marital", 
"health.gender", "health.employed", "health.fruitveg", "health.age", 
"health.alcohol"), c("health.marital", "health.gender", "health.employed", 
"health.fruitveg", "health.age", "health.alcohol")))

1 个答案:

答案 0 :(得分：4)

一种选择是将初始数据中的上三角值replace到NA，然后用na.rm = TRUE从gather中将其删除

corr %>% 
   replace(., upper.tri(., diag = TRUE), NA) %>%
   as_tibble(rownames = "From") %>% 
   gather(key = "To", value = "corr", -From, na.rm = TRUE) %>% 
   mutate(corr_abs = abs(corr)) %>% 
   filter(corr_abs > 0.1) %>% 
   arrange(-corr_abs)
# A tibble: 4 x 4
#  From           To                corr corr_abs
#  <chr>          <chr>            <dbl>    <dbl>
#1 health.age     health.employed -0.393    0.393
#2 health.age     health.marital   0.212    0.212
#3 health.alcohol health.gender    0.187    0.187
#4 health.age     health.fruitveg  0.100    0.100