Question

我有一个巨大的数据框（> 1,000,000行）。

term    estimate    st.error    statistic    p.value    SNP
(Intercept)    7.68    0.17    44.64    0    rs1406947
GT    0.01    0.01    0.07    0.19    rs1406947     
SEX    1.52    0.14    10.87    0.1    rs1406947 
M    0.12    0.29    0.41    0.67    rs1406947   
N    -0.06    0.12    -0.48    0.63    rs1406947
GT:SEX    -0.03    0.08    -0.44    0.65    rs1406947
GT:N    -0.00    0.06    -0.08    0.93    rs1406947   
(Intercept)    9.23    0.20    34.64    0    rs25904
GT    0.05    0.04    0.12    0.22    rs25904    
SEX    1.67    0.76    10.34    0.1    rs25904 
M    0.14    0.39    0.51    0.55    rs25904   
N    -0.08    0.05    -0.46    0.55    rs25904
GT:SEX    -0.19    0.11    -0.34    0.44    rs25904
GT:N    -0.22    0.33    -0.44    0.55    rs25904           
(Intercept)    7.99    0.66    44.44    0    rs7133579
GT    0.01    0.3    0.04    0.33    rs7133579    
SEX    1.22    0.22    10.44    0.15    rs7133579 
M    0.88    0.22    0.33    0.44    rs7133579   
N    -0.5    0.5    -0.5    0.6    rs7133579
GT:N    -0.00    0.03    -0.04    0.78    rs7133579

它由7个观测值的块组成：（拦截），GT，SEX，M，N，GT：SEX和GT：N。但是，有几个区块缺少一个或多个观测值（例如，第三个区块缺少GT：SEX）。我想使用R删除这些块。在这个玩具示例中，我将得到：

term    estimate    st.error    statistic    p.value    SNP
(Intercept)    7.68    0.17    44.64    0    rs1406947
GT    0.01    0.01    0.07    0.19    rs1406947     
SEX    1.52    0.14    10.87    0.1    rs1406947 
M    0.12    0.29    0.41    0.67    rs1406947   
N    -0.06    0.12    -0.48    0.63    rs1406947
GT:SEX    -0.03    0.08    -0.44    0.65    rs1406947
GT:N    -0.00    0.06    -0.08    0.93    rs1406947   
(Intercept)    9.23    0.20    34.64    0    rs25904
GT    0.05    0.04    0.12    0.22    rs25904    
SEX    1.67    0.76    10.34    0.1    rs25904 
M    0.14    0.39    0.51    0.55    rs25904   
N    -0.08    0.05    -0.46    0.55    rs25904
GT:SEX    -0.19    0.11    -0.34    0.44    rs25904
GT:N    -0.22    0.33    -0.44    0.55    rs25904

Answer 1

假设每次都出现(Intercept)，则可以测试每个块的length是否为7。

x[unlist(lapply(split(seq_len(nrow(x)), cumsum(x$term == "(Intercept)")),
                function(y) {if(length(y) == 7) y else NULL})), ]
#          term estimate st.error statistic p.value       SNP
#1  (Intercept)     7.68     0.17     44.64    0.00 rs1406947
#2           GT     0.01     0.01      0.07    0.19 rs1406947
#3          SEX     1.52     0.14     10.87    0.10 rs1406947
#4            M     0.12     0.29      0.41    0.67 rs1406947
#5            N    -0.06     0.12     -0.48    0.63 rs1406947
#6       GT:SEX    -0.03     0.08     -0.44    0.65 rs1406947
#7         GT:N     0.00     0.06     -0.08    0.93 rs1406947
#8  (Intercept)     9.23     0.20     34.64    0.00   rs25904
#9           GT     0.05     0.04      0.12    0.22   rs25904
#10         SEX     1.67     0.76     10.34    0.10   rs25904
#11           M     0.14     0.39      0.51    0.55   rs25904
#12           N    -0.08     0.05     -0.46    0.55   rs25904
#13      GT:SEX    -0.19     0.11     -0.34    0.44   rs25904
#14        GT:N    -0.22     0.33     -0.44    0.55   rs25904

数据：

x <- read.table(header=TRUE, text="term    estimate    st.error    statistic    p.value    SNP
(Intercept)    7.68    0.17    44.64    0    rs1406947
GT    0.01    0.01    0.07    0.19    rs1406947     
SEX    1.52    0.14    10.87    0.1    rs1406947 
M    0.12    0.29    0.41    0.67    rs1406947   
N    -0.06    0.12    -0.48    0.63    rs1406947
GT:SEX    -0.03    0.08    -0.44    0.65    rs1406947
GT:N    -0.00    0.06    -0.08    0.93    rs1406947   
(Intercept)    9.23    0.20    34.64    0    rs25904
GT    0.05    0.04    0.12    0.22    rs25904    
SEX    1.67    0.76    10.34    0.1    rs25904 
M    0.14    0.39    0.51    0.55    rs25904   
N    -0.08    0.05    -0.46    0.55    rs25904
GT:SEX    -0.19    0.11    -0.34    0.44    rs25904
GT:N    -0.22    0.33    -0.44    0.55    rs25904           
(Intercept)    7.99    0.66    44.44    0    rs7133579
GT    0.01    0.3    0.04    0.33    rs7133579    
SEX    1.22    0.22    10.44    0.15    rs7133579 
M    0.88    0.22    0.33    0.44    rs7133579   
N    -0.5    0.5    -0.5    0.6    rs7133579
GT:N    -0.00    0.03    -0.04    0.78    rs7133579")

Answer 2

我认为您想按SNP分组，并检查这些区块是否符合您的期望：

library(dplyr)

expected_terms <- c("(Intercept)", "GT", "SEX", "M", "N", "GT:SEX", "GT:N")

df %>%
  group_by(SNP) %>%
  filter(
    all(expected_terms %in% term)
  )

更严格的是，如果您需要确保每个术语仅存在一次或不出现其他术语：

df %>%
  group_by(SNP) %>%
  filter(
    # use `table` to count occurrence of terms, keep only if all are counted exactly once
    all(table(term)[expected_terms] == 1),
    # keep only if no terms remain after removing your expected set
    length(setdiff(term, expected_terms)) == 0
  )

R.如果满足条件，则删除df中的观测块

2 个答案: