R data.table有序列查找

时间:2015-11-10 01:55:56

标签: r data.table

我有一个带有id列和多列的R data.table,指定有序的阈值级别和相应的值。我想要做的是查找第一级的每一行,该行大于或等于该id的参数并返回相应的值。

以下是一个示例数据集。

#include <iostream>
#include <iomanip>
#include <cmath>
using namespace std;


double factorial (double N, double x, double p);
double tgamma (double N, double x, double p);

int main ()
{
double N;
double xlow;
double xhigh;
double p;
double Probability;
double result;


cout << "Input N value" << endl;
cin >> N;

cout << "Input low end of x Value" << endl;
cin >> xlow;
while(xlow<0 || xlow>N){
    cout << "x value is NOT between 0 and N." << endl;
    cout << "Input x Value" << endl;
    cin >> xlow;
}

cout << "Input high endx Value" << endl;
cin >> xhigh;
while(xhigh<0 || xhigh>N){
        cout << "x value is NOT between 0 and N." << endl;
        cout << "Input x Value" << endl;
        cin >> xhigh;
}

cout << "Input p value" << endl;
cin >> p;
while(p<0 || p>1){
    cout << "p value is NOT a real number between 0 and 1." << endl;
    cout << "Input p value" << endl;
    cin >> p;
}


while (xlow <= xhigh){

result =((tgamma((N+1)))/((tgamma(xlow+1)) * ((tgamma(((N-xlow+1))))))) * (pow(p,xlow)) * (pow((1-p),(N-xlow)));
    Probability += result;
    ++xlow;
}
cout << "Sum of Probabilities is: " << Probability << endl;
return 0;
}

所以如果查找参数:

DT<-data.table(id=c("Obs1","Obs2"),
    level.1=c(1,1),level.2=c(2,4),level.3=c(3,8),
    val.1=c(10,10),val.2=c(20,30),val.3=c(30,50))

DT
     id level.1 level.2 level.3 val.1 val.2 val.3
1: Obs1       1       2       3    10    20    30
2: Obs2       1       4       8    10    30    50

返回的值应为:

params<-list("Obs1"=2.5,"Obs2"=1) 

我还希望级别和值的数量有些随意,尽管它们将满足类似于示例的命名约定

我可以使用几个步骤来解决这个问题,但它非常难看并且计算效率可能不高:

c(30,10).

我之前使用plyr :: ddply更清晰地使用data.frames解决了这个问题,而且我可以在data.frame中使用变量名称这一事实。 (为简洁起见,我不在此处包含该解决方案。)

欢迎提出任何改进建议。

2 个答案:

答案 0 :(得分:5)

我使用滚动连接进行如下操作:

DT_m = melt(DT, measure=patterns("^level", "^val"), value.name=c("level", "val"))
query = list(id=c("Obs1", "Obs2"), level=c(2.5, 1))
DT_m[query, val, on=c("id", "level"), roll=-Inf]

roll=-Inf执行NOCB连接(后续观察向后移动)。当要加入的值(此处为query)落入间隙时,下一个观察值将作为匹配行向后传送。例如,2.5介于24之间。因此匹配的行是4(下一次观察)。相应的val30

答案 1 :(得分:2)

这是一种方式:

mDT = melt(DT, measure.var = patterns("level","val"), value.name = c("level","val"))
setkey(mDT, id)

#      id variable level val
# 1: Obs1        1     1  10
# 2: Obs1        2     2  20
# 3: Obs1        3     3  30
# 4: Obs2        1     1  10
# 5: Obs2        2     4  30
# 6: Obs2        3     8  50

params2 <- list(id = c("Obs1","Obs2"), v=c(2.5,1)) 
mDT[params2,{
  i = findInterval(v, level, rightmost.closed=TRUE)
  val[ i + (v != level[i]) ]
}, by=.EACHI]

#      id V1
# 1: Obs1 30
# 2: Obs2 10

如果您在params$v上方设置level,则会返回NA

params3 <- list(id = c("Obs1","Obs2"), v=c(5, 1)) 
mDT[params3, {i = findInterval(v, level, rightmost.closed=TRUE); val[ i + (v != level[i])]}, by=.EACHI]

#      id V1
# 1: Obs1 NA
# 2: Obs2 10

评论。我认为以长/融合形式保存数据比使用列名称玩游戏更好。

如果您想将参数作为键值对输入,stacksetNames会有所帮助:

p0      = list(Obs1 = 1, Obs2 = 2.5)
params0 = setNames(stack(p0), c("v","id"))