Question

我试图在R中创建一个函数如果X是矢量而Y是作为X子集的矢量，X和Y可以包含重复元素，则X-Y包含剩余元素（可能仍然包含重复元素）。我尝试使用setdiff()，但我认为它不适用于重复元素。防爆。

d<-c(1,1,1,5,5,5,3,0,10,10)
b<-c(1,1,0)
e<-setdiff(d,b)
e
[1]  5  3 10

但它应该是

c(1,5,5,5,3,10,10)

所以我做了一个功能

my.sample<-function(d,b){
  y<-numeric()
  u<-numeric()
  t<-list()
  x<-numeric()
  rd<-rle(d)
  rb<-rle(b)
  h<-numeric()
  d.data<-data.frame(rd$lengths,rd$values)
  b.data<-data.frame(rb$lengths,rb$values)

  for(i in 1:nrow(b.data)){
    y[i]<-b.data[i,2]
    u[i]<-b.data[i,1]
    h[i]<-(d.data[d.data$rd.values==y[i],1]-u[i])
    d.data[d.data$rd.values==y[i],1]<-h[i]
  }
  x<-d.data[,1]
  for(j in 1:length(x))
  {
    t[[j]]<-rep(d.data[j,2],x[j])        
  }
  return(unlist(t))        
}

所以我试过

my.sample(d,b)
[1]  1  5  5  5  3 10 10

所以我认为我做了正确的算法，但当我尝试将它用于另一个更复杂的矢量，如

x<-rpois(100,10)
y<-sample(x,25,replace=F)
my.sample(x,y)

Error in rep(d.data[j, 2], x[j]) : invalid 'times' argument
In addition: There were 21 warnings (use warnings() to see them)

突然出现错误和21次警告:(，你们可以请我帮忙，顺便说一句，我是编程的新手，所以请帮帮我。谢谢

Answer 1

因为你允许重复你有一个递归的问题，最简单的~~和最合适的~~解决方案是使用for循环来遍历b的元素和使用d一次从match中删除它们，x仅查找匹配的第一个匹配项。此函数还首先检查y是f <- function(x,y){ if( all( x %in% y ) ) for( i in x ) y <- y[ -match( i , y ) ] return( y ) } f(b,d) #[1] 1 5 5 5 3 10 10的一个子集：

set.seed(42)
x<-rpois(100,10)
y<-sample(x,25,replace=F)
f(y,x)
# [1] 11 12  9 10 10  9 10  4  9  6

使用你的第二个例子......

{{1}}

Answer 2

另一个功能：

f <- function(d, b) 
  d[-unlist(tapply(b, b, function(y) head(which(d == y[1]), length(y))))]    

# first example:
f(d, b)
# [1]  1  5  5  5  3 10 10

# second example:
set.seed(42)
x <- rpois(100,10)
y <- sample(x,90,replace=F)
f(x,y)
# [1] 11 12  9 10 10  9 10  4  9  6

Answer 3

编辑：发布的答案最快：

carl2<-function(x,y) {
xfact<-as.numeric(names(table(xfoo)))
 tx<-table(xfoo)
 yfact<-as.numeric(names(table(yfoo)))
 ty<-table(yfoo)
 gotit<- ave(c(tx,ty),c(xfact,yfact),FUN=function(a) if(length(a)==2) a[1]-a[2] else a[1])
 gotx<-gotit[1:length(tx)]
 fakerle<-data.frame(values=as.numeric(names(gotx)),lengths=gotx)
 finalx<-inverse.rle(fakerle)
}

它可能比下面的最佳基准快25％。好的，我现在就停止这个废话。

这是另一种方法：

Rgames> ds<-sort(d)
Rgames> db<-sort(b)
Rgames> ds[(length(db)+1):length(ds)]
[1]  1  3  5  5  5 10 10

由于显而易见的原因，这完全不起作用。我的最喜欢的工具来救援：

Rgames> set.seed(1)
Rgames> x<-rpois(100,10)
Rgames> y<-sample(x,25,replace=F)
Rgames> rx<-rle(sort(x))
Rgames> ry<-rle(sort(y))
Rgames>  for(j in ry$values ) rx$lengths[which(rx$values==j)] <- rx$lengths[which(rx$values==j)] - ry$lengths[ry$values==j]
Rgames> 
Rgames>  newx<-inverse.rle(rx[rx$values>0])
Rgames> newx
 [1]  3  4  5  5  5  5  5  6  6  7  7  7  7  7  7  7  7  7  7  8  8  8
[23]  8  8  8  8  9  9  9  9  9  9  9  9  9 10 10 10 10 10 10 10 11 11
[45] 11 11 11 11 11 11 11 11 12 12 12 12 12 12 12 12 12 13 13 13 13 13
[67] 13 14 14 14 14 14 15 15 16

希望OP不关心输出元素的顺序！

编辑，只是为了完善线程，现在以正确的顺序参数，遗憾的是西蒙不再获胜。哦，好吧。

Rgames>  microbenchmark(sven(x,y),simon(y,x),carl(x,y))
Unit: milliseconds
        expr        min         lq     median         uq        max
  sven(x, y)   1.724172   1.803495   1.858658   1.975400   2.073966
 simon(y, x) 104.202881 105.159258 105.928977 106.315333 190.408444
  carl(x, y)   1.705784   1.806489   1.845403   1.927078  22.150382

Answer 4

试试这个：

d<-c(1,1,1,5,5,5,3,0,10,10)
b<-c(1,1,0)
d[!(d %in% b)]

R编程：矢量之间的差异

4 个答案: