R:更有效地子集数据

时间:2015-11-20 18:05:29

标签: r subset

我有一个数据集df:

df=data.frame(rbind(c("A",1,1,"abc"),
                    c("B",0,0,"def"),
                    c("C",0,1,"hep"),
                    c("A",1,1,"hit"),
                    c("B",0,1,"occ"),
                    c("C",1,1,"tem"),
                    c("A",1,1,"twi"),
                    c("B",1,1,"twa"),
                    c("C",1,1,"mit"),
                    c("A",1,1,"mot"),
                    c("C",1,1,"mot"),
                    c("B",1,1,"mjak")))
names(df)=c("id","v1","v2","check")

我想在DF中创建一个id子集,其中包含“check”列中“ch.vars”向量中包含的值。

ch.vars=c("abc","hit","mot","twi","mjak")

如果id包含除“ch.vars”中给出的值之外的任何值,则它们将从数据集中排除。例如,ID和C在检查列中包含其他值,因此它们将被排除在子集。

这是我到目前为止所尝试的内容:

df$check.var=ifelse(df$check %in% ch.vars,1,0)
df=arrange(df,id)

st1=filter(df,check.var==0)
st1=as.character(unique(st1$id))

df2=df[!df$id %in% st1,]

> df2
  id v1 v2 check check.var
1  A  1  1   abc         1
2  A  1  1   hit         1
3  A  1  1   twi         1
4  A  1  1   mot         1

这有效,但我想知道是否有更有效的方法来做到这一点,即以更少的步骤实现结果。谢谢!

2 个答案:

答案 0 :(得分:3)

您可以在dplyr包中使用int stepY, stepX, yMin, yMax, yOpposite, yStart, xMin, xMax, xOpposite, xStart; if (yOpposite > yStart) { stepY = 1; yMin = yStart; yMax = yOpposite; } else { stepY = -1; yMax = yStart; yMin = yOpposite; } if (xOpposite > xStart) { stepX = -1; xMin = xStart; xMax = xOpposite; } else { stepX = 1; xMin = xOpposite; xMax = xStart; } // boolean followAlongX = false; // if (xMax-xMin>yMax-yMin) { // loopOnX = true; // } List<Points> path = new ArrayList<>(); if (followAlongX) { for (int i=yMin; i!=yMax; i+=stepY) { for (int j=xmin; j!=xmax; j+=stepX) { path.add(new Point(i,j)); } stepX = -stepX; int temp = xMin; xMin = xMax; xMax = temp; } } else { for (int j=xmin; j!=xmax; j+=stepX) { for (int i=yMin; i!=yMax; i+=stepY) { path.add(new Point(i,j)); } stepY = -stepY; int temp = yMin; yMin = yMax; yMax = temp; } } return path.toArray(new Point[path.size()]); group_by执行此操作:

filter

答案 1 :(得分:3)

一个data.table解决方案:

library(data.table)
data.table(df)[,.SD[all(check%in%ch.vars)],by="id"]
#   id v1 v2 check
#1:  A  1  1   abc
#2:  A  1  1   hit
#3:  A  1  1   twi
#4:  A  1  1   mot

您还可以setkey使用id来加快速度。