比较两个数据帧之间的值以生成连接的输出文件

时间:2016-03-07 20:25:10

标签: r

我有两个数据帧,dfA和dfB。

DFA:

ID <- c('ID1','ID2','ID3','ID4')
lowval <- c(12,13,20,40)
upval <- c(14,15,22,42)
cr <- c("item1","item2","item3","item4")
dfA <- data.frame(ID,lowval,upval,cr)

>dfA
   ID lowval upval    cr
1 ID1     12    14 item1
2 ID2     13    15 item2
3 ID3     20    22 item3
4 ID4     40    42 item4

DFB:

match <- c('30','30','30','30')
pos <- c(3,13,18,41)
desc <- c("heavy","light","blue","black")
dfB <- data.frame(match,pos,desc)

>dfB
  match pos  desc
1    30   3 heavy
2    30  13 light
3    30  18  blue
4    30  41 black

我想遍历每一行,询问dfB $ pos是否位于dfA $ lowval和dfB $ upval之间,如果是,则将整行从dfA和dfB打印到输出文件中。

在这种情况下,所需的输出文件如下所示:

   ID lowval upval    cr match pos  desc
  ID1     12    14 item1    30  13 light
  ID4     40    42 item4    30  41 black

我尝试过创建一个函数:

f <- function(x, y, output) {
lowervalue = x[2]
uppervalue = x[3]
position = y[2]
if(position>=lowervalue & position<=uppervalue){
print(paste(x,y,sep="\t"))
cat(paste(x,y, sep="\t"), file= output, append = T, fill = T)
}
}
apply(dfA, dfB, f, output = 'outputfile.txt')

但是我收到了以下错误:

Error in ds[-MARGIN] : invalid subscript type 'list'
In addition: Warning messages:
1: In Ops.factor(left) : ‘-’ not meaningful for factors
2: In Ops.factor(left) : ‘-’ not meaningful for factors

有人可以建议创建此输出文件的解决方案吗?我很困难。

2 个答案:

答案 0 :(得分:1)

你可以尝试:

merge(dfA, dfB)[c(sapply(dfB$pos, function(x) apply(dfA[2:3], 1, function(y) 
y[1] <= x & y[2] >= x))),]

    ID lowval upval    cr match pos  desc
5  ID1     12    14 item1    30  13 light
6  ID2     13    15 item2    30  13 light
16 ID4     40    42 item4    30  41 black

答案 1 :(得分:1)

解决方案1:outer()

f <- 'output.txt';
write(capture.output(print(with(as.data.frame(which(outer(dfB$pos,dfA$lowval,`>=`) & outer(dfB$pos,dfA$upval,`<=`),arr.ind=T)),cbind(dfA[col,],dfB[row,])),row.names=F)),f);
cat(readLines(f),sep='\n');
##   ID lowval upval    cr match pos  desc
##  ID1     12    14 item1    30  13 light
##  ID2     13    15 item2    30  13 light
##  ID4     40    42 item4    30  41 black

在您的问题中,您的预期输出中没有ID2,但基于包容性比较(例如>=>)13在13到15之间,所以它有资格作为比赛。

解决方案2:lapply()

f <- 'output.txt';
write(capture.output(print(do.call(rbind,lapply(seq_len(nrow(dfA)),function(ai) { res <- dfB$pos>=dfA$lowval[ai] & dfB$pos<=dfA$upval[ai]; if (any(res)) cbind(dfA[ai,],dfB[res,]); })),row.names=F)),f);
cat(readLines(f),sep='\n');
##   ID lowval upval    cr match pos  desc
##  ID1     12    14 item1    30  13 light
##  ID2     13    15 item2    30  13 light
##  ID4     40    42 item4    30  41 black