我有一个巨大的数据框(> 1,000,000行)。
term estimate st.error statistic p.value SNP
(Intercept) 7.68 0.17 44.64 0 rs1406947
GT 0.01 0.01 0.07 0.19 rs1406947
SEX 1.52 0.14 10.87 0.1 rs1406947
M 0.12 0.29 0.41 0.67 rs1406947
N -0.06 0.12 -0.48 0.63 rs1406947
GT:SEX -0.03 0.08 -0.44 0.65 rs1406947
GT:N -0.00 0.06 -0.08 0.93 rs1406947
(Intercept) 9.23 0.20 34.64 0 rs25904
GT 0.05 0.04 0.12 0.22 rs25904
SEX 1.67 0.76 10.34 0.1 rs25904
M 0.14 0.39 0.51 0.55 rs25904
N -0.08 0.05 -0.46 0.55 rs25904
GT:SEX -0.19 0.11 -0.34 0.44 rs25904
GT:N -0.22 0.33 -0.44 0.55 rs25904
(Intercept) 7.99 0.66 44.44 0 rs7133579
GT 0.01 0.3 0.04 0.33 rs7133579
SEX 1.22 0.22 10.44 0.15 rs7133579
M 0.88 0.22 0.33 0.44 rs7133579
N -0.5 0.5 -0.5 0.6 rs7133579
GT:N -0.00 0.03 -0.04 0.78 rs7133579
它由7个观测值的块组成:(拦截),GT,SEX,M,N,GT:SEX和GT:N。但是,有几个区块缺少一个或多个观测值(例如,第三个区块缺少GT:SEX)。我想使用R删除这些块。在这个玩具示例中,我将得到:
term estimate st.error statistic p.value SNP
(Intercept) 7.68 0.17 44.64 0 rs1406947
GT 0.01 0.01 0.07 0.19 rs1406947
SEX 1.52 0.14 10.87 0.1 rs1406947
M 0.12 0.29 0.41 0.67 rs1406947
N -0.06 0.12 -0.48 0.63 rs1406947
GT:SEX -0.03 0.08 -0.44 0.65 rs1406947
GT:N -0.00 0.06 -0.08 0.93 rs1406947
(Intercept) 9.23 0.20 34.64 0 rs25904
GT 0.05 0.04 0.12 0.22 rs25904
SEX 1.67 0.76 10.34 0.1 rs25904
M 0.14 0.39 0.51 0.55 rs25904
N -0.08 0.05 -0.46 0.55 rs25904
GT:SEX -0.19 0.11 -0.34 0.44 rs25904
GT:N -0.22 0.33 -0.44 0.55 rs25904
答案 0 :(得分:2)
假设每次都出现(Intercept)
,则可以测试每个块的length
是否为7
。
x[unlist(lapply(split(seq_len(nrow(x)), cumsum(x$term == "(Intercept)")),
function(y) {if(length(y) == 7) y else NULL})), ]
# term estimate st.error statistic p.value SNP
#1 (Intercept) 7.68 0.17 44.64 0.00 rs1406947
#2 GT 0.01 0.01 0.07 0.19 rs1406947
#3 SEX 1.52 0.14 10.87 0.10 rs1406947
#4 M 0.12 0.29 0.41 0.67 rs1406947
#5 N -0.06 0.12 -0.48 0.63 rs1406947
#6 GT:SEX -0.03 0.08 -0.44 0.65 rs1406947
#7 GT:N 0.00 0.06 -0.08 0.93 rs1406947
#8 (Intercept) 9.23 0.20 34.64 0.00 rs25904
#9 GT 0.05 0.04 0.12 0.22 rs25904
#10 SEX 1.67 0.76 10.34 0.10 rs25904
#11 M 0.14 0.39 0.51 0.55 rs25904
#12 N -0.08 0.05 -0.46 0.55 rs25904
#13 GT:SEX -0.19 0.11 -0.34 0.44 rs25904
#14 GT:N -0.22 0.33 -0.44 0.55 rs25904
数据:
x <- read.table(header=TRUE, text="term estimate st.error statistic p.value SNP
(Intercept) 7.68 0.17 44.64 0 rs1406947
GT 0.01 0.01 0.07 0.19 rs1406947
SEX 1.52 0.14 10.87 0.1 rs1406947
M 0.12 0.29 0.41 0.67 rs1406947
N -0.06 0.12 -0.48 0.63 rs1406947
GT:SEX -0.03 0.08 -0.44 0.65 rs1406947
GT:N -0.00 0.06 -0.08 0.93 rs1406947
(Intercept) 9.23 0.20 34.64 0 rs25904
GT 0.05 0.04 0.12 0.22 rs25904
SEX 1.67 0.76 10.34 0.1 rs25904
M 0.14 0.39 0.51 0.55 rs25904
N -0.08 0.05 -0.46 0.55 rs25904
GT:SEX -0.19 0.11 -0.34 0.44 rs25904
GT:N -0.22 0.33 -0.44 0.55 rs25904
(Intercept) 7.99 0.66 44.44 0 rs7133579
GT 0.01 0.3 0.04 0.33 rs7133579
SEX 1.22 0.22 10.44 0.15 rs7133579
M 0.88 0.22 0.33 0.44 rs7133579
N -0.5 0.5 -0.5 0.6 rs7133579
GT:N -0.00 0.03 -0.04 0.78 rs7133579")
答案 1 :(得分:2)
我认为您想按SNP
分组,并检查这些区块是否符合您的期望:
library(dplyr)
expected_terms <- c("(Intercept)", "GT", "SEX", "M", "N", "GT:SEX", "GT:N")
df %>%
group_by(SNP) %>%
filter(
all(expected_terms %in% term)
)
更严格的是,如果您需要确保每个术语仅存在一次或不出现其他术语:
df %>%
group_by(SNP) %>%
filter(
# use `table` to count occurrence of terms, keep only if all are counted exactly once
all(table(term)[expected_terms] == 1),
# keep only if no terms remain after removing your expected set
length(setdiff(term, expected_terms)) == 0
)