Question

我有一个right_join表，其中某些列具有NA值，具体取决于条目源自哪个表。表中的每个“匹配”都有一个以0开头的“ indx”。

我想group_by(hit, indx)并进行一些条件过滤。我想最好使用dplyr。

以下是数据：

test <- tibble(hit = c(rep("101mA", 4), rep("1914A", 5)), 
               indx = c(0, 0, 0, 1, 0, 0, 0, 0, 1),
               hit_start = c(7, 63, 105, 131, 4, 7, 56, 64, 147), 
               hit_end = c(112, 82, 126, 152, 82, 34, 83, 81, 166), 
               stamp_score = c(NA, 9.32, 9.30, 9.49, NA, NA, NA, 8.16, 9.15), 
               bit_score = c(76.2, NA, NA, NA, 84.7, 8.3, 0.3, NA, NA) 
              )

这是桌子：

# A tibble: 9 x 6
  hit    indx hit_start hit_end stamp_score bit_score
  <chr> <dbl>     <dbl>   <dbl>       <dbl>     <dbl>
1 101mA     0         7     112       NA         76.2
2 101mA     0        63      82        9.32      NA  
3 101mA     0       105     126        9.30      NA  
4 101mA     1       131     152        9.49      NA  
5 1914A     0         4      82       NA         84.7
6 1914A     0         7      34       NA          8.3
7 1914A     0        56      83       NA          0.3
8 1914A     0        64      81        8.16      NA  
9 1914A     1       147     166        9.15      NA

在每个group_by(hit, indx)中，如果“ stamp_score”列中甚至只有一个NA，我想保留带有NA条目的行。但是，如果某个组的“ stamp_score”列中没有NA，我想保留所有行。

这是我最后的期望：

# A tibble: 6 x 6
  hit    indx hit_start hit_end stamp_score bit_score
  <chr> <dbl>     <dbl>   <dbl>       <dbl>     <dbl>
1 101mA     0         7     112       NA         76.2
4 101mA     1       131     152        9.49      NA  
5 1914A     0         4      82       NA         84.7
6 1914A     0         7      34       NA          8.3
7 1914A     0        56      83       NA          0.3
9 1914A     1       147     166        9.15      NA

请注意，我打算最终将代码用于具有10000多个匹配的表格，每个匹配都有自己的“ indx”。

Answer 1

我不确定是否要将NA中的值保留在stamp_score中或将其删除。但我认为这应该可以完成工作：

library(dplyr)

# create the df where you only have group with non missing obs
noNAind <- test %>% group_by(indx) %>% filter(!any(is.na(stamp_score))) %>% ungroup()
noNAhit <- test %>% group_by(hit) %>% filter(!any(is.na(stamp_score))) %>% ungroup()

# create the df with all the missing obs 
missind<- test %>% group_by(indx) %>% filter(is.na(stamp_score)) %>% ungroup()
misshit<- test %>% group_by(hit) %>% filter(is.na(stamp_score)) %>% ungroup()

# merge the data
test<- full_join(noNAind,noNAhit) %>% distinct()
test<- full_join(test,missind) %>% distinct()
test<- full_join(test,misshit) %>% distinct()

Answer 2

实际上，我在另一个相关的question中找到了答案。

这使用了data.table个班轮，在我的情况下是：

library(data.table)

test <- setDT(test)[, if(any(is.na(stamp_score))) .SD[is.na(stamp_score)] else .SD, .(hit, indx)]

本质上，此代码仅在“ stamp_score”列中存在NA时才对组进行子集化。

感谢所有尝试提供帮助的人，并且随着时间的推移也帮助我改善了我的问题。

仅在满足条件时过滤dplyr组，否则不

2 个答案: