如何提取每个组的前n行并使用该子集计算函数?

时间:2018-10-22 15:17:28

标签: r data.table

我的问题与此非常相似: How to extract the first n rows per group?

#include <string>
#include <unordered_map>

template<class T, class U> using umap = std::unordered_map<T, U>;


umap<std::string, double> getWeights(const std::string& nodeName, const umap<std::string, umap<std::string, double>>& weightTrees)
{
    const auto it = weightTrees.find(nodeName);
    if (it == weightTrees.end())
        return umap<std::string, double>();

    umap<std::string, double> topWeights = it->second;
    std::vector<std::string> topNodeNames;

    for (const auto& kv : topWeights)
        topNodeNames.push_back(kv.first);

    for (const std::string& topNodeName : topNodeNames)
    {
        umap<std::string, double> subWeights = getWeights(topNodeName, weightTrees);
        if (subWeights.size() > 0)
        {
            const double topWeight = topWeights[topNodeName];
            topWeights.erase(topNodeName);
            for (const auto& subWeight : subWeights)
            {
                const auto it = topWeights.find(subWeight.first);
                if (it == topWeights.end())
                    topWeights[subWeight.first] = topWeight * subWeight.second;
                else
                    it->second += topWeight * subWeight.second;
            }
        }
    }

    return topWeights;
}


int main()
{
    umap<std::string, umap<std::string, double>> weightTrees = {{ "Node0", {{ "Node1",0.5 },{ "Node2",0.3 },{ "Node3",0.2 }} },
                                                                { "Node1", {{ "Node2",0.1 },{ "Node4",0.9 }} }};

    umap<std::string, double> w = getWeights("Node0", weightTrees); // gives {Node2: 0.35, Node3: 0.20, Node4: 0.45}
}

我们有一个dt date age name val 1: 2000-01-01 3 Andrew 93.73546 2: 2000-01-01 4 Ben 101.83643 3: 2000-01-01 5 Charlie 91.64371 4: 2000-01-02 6 Adam 115.95281 5: 2000-01-02 7 Bob 103.29508 6: 2000-01-02 8 Campbell 91.79532 ,我添加了一个名为dt的额外列。首先,我们要提取每个组中的前n行。 提供的链接中的解决方案是:

val

我的问题是,如果该函数取决于子集信息,那么该如何将函数应用于每个组中的前n行。我正在尝试应用这样的内容:

dt[, .SD[1:2], by=date] # where 1:2 is the index needed
dt[dt[, .I[1:2], by = date]$V1] # for speed

我要解决这个问题吗?有没有更有效的方法可以做到这一点?我似乎无法弄清楚如何为此应用“速度”解决方案。有没有一种方法,而不必先保存子集运算的结果并立即按日期将函数应用于前两行?

我们将不胜感激,下面是产生上面数据的代码:

  # uses other columns for results/ is dependent on subsetted rows
  # but keep it simple for replication
do_something <- function(dt){
  res <- ifelse(cumsum(dt$val) > 200, 1, 0)  
  return(res)
}
# first 2 rows of dt by group=date
x <- dt[, .SD[1:2], by=date]
# apply do_something to first 2 rows of dt by group=date
x[, list('age'=age,'name'=name,'val'=val, 'funcVal'= do_something(.SD[1:2])),by=date]

          date age   name       val funcVal
1: 2000-01-01   3 Andrew  93.73546       0
2: 2000-01-01   4    Ben 101.83643       1
3: 2000-01-02   6   Adam 115.95281       0
4: 2000-01-02   7    Bob 103.29508       1

2 个答案:

答案 0 :(得分:5)

如果分组列不止一个,则将其折叠为一个可能更有效:

m = dt[, .(g = .GRP, r = .I[1:2]), by = date]
dt[m$r, v := ff(.SD), by=m$g, .SDcols="val"]

这只是@eddi's approach的扩展(保留行号.I,在@akrun的答案中可见)也可以保留组计数器.GRP


关于OP的评论,他们更加关注该功能,嗯,是从@akrun借来的...

ff = function(x) as.integer(cumsum(x[[1]]) > 200)

假设所有值均为非负数,则可以在C中更有效地处理此问题,因为一旦达到阈值,累积和就可以停止。不过,对于两行的特殊情况,这无关紧要。

我的印象是,这是一个伪函数,因此毫无意义。我通常想到的许多效率改进取决于功能和数据。

答案 1 :(得分:3)

我们可以在as.integer上使用cumsum来将逻辑强制转换为二进制。提取行索引,将其指定为i,按“日期”分组,然后在“ val”列上应用该函数

f1 <- function(x) as.integer(cumsum(x) > 200)
i1 <- dt[, .I[1:2], by = date]$V1
dt[i1, newcol := f1(val), date]