我的问题与此非常相似: How to extract the first n rows per group?
#include <string>
#include <unordered_map>
template<class T, class U> using umap = std::unordered_map<T, U>;
umap<std::string, double> getWeights(const std::string& nodeName, const umap<std::string, umap<std::string, double>>& weightTrees)
{
const auto it = weightTrees.find(nodeName);
if (it == weightTrees.end())
return umap<std::string, double>();
umap<std::string, double> topWeights = it->second;
std::vector<std::string> topNodeNames;
for (const auto& kv : topWeights)
topNodeNames.push_back(kv.first);
for (const std::string& topNodeName : topNodeNames)
{
umap<std::string, double> subWeights = getWeights(topNodeName, weightTrees);
if (subWeights.size() > 0)
{
const double topWeight = topWeights[topNodeName];
topWeights.erase(topNodeName);
for (const auto& subWeight : subWeights)
{
const auto it = topWeights.find(subWeight.first);
if (it == topWeights.end())
topWeights[subWeight.first] = topWeight * subWeight.second;
else
it->second += topWeight * subWeight.second;
}
}
}
return topWeights;
}
int main()
{
umap<std::string, umap<std::string, double>> weightTrees = {{ "Node0", {{ "Node1",0.5 },{ "Node2",0.3 },{ "Node3",0.2 }} },
{ "Node1", {{ "Node2",0.1 },{ "Node4",0.9 }} }};
umap<std::string, double> w = getWeights("Node0", weightTrees); // gives {Node2: 0.35, Node3: 0.20, Node4: 0.45}
}
我们有一个dt
date age name val
1: 2000-01-01 3 Andrew 93.73546
2: 2000-01-01 4 Ben 101.83643
3: 2000-01-01 5 Charlie 91.64371
4: 2000-01-02 6 Adam 115.95281
5: 2000-01-02 7 Bob 103.29508
6: 2000-01-02 8 Campbell 91.79532
,我添加了一个名为dt
的额外列。首先,我们要提取每个组中的前n行。
提供的链接中的解决方案是:
val
我的问题是,如果该函数取决于子集信息,那么该如何将函数应用于每个组中的前n行。我正在尝试应用这样的内容:
dt[, .SD[1:2], by=date] # where 1:2 is the index needed
dt[dt[, .I[1:2], by = date]$V1] # for speed
我要解决这个问题吗?有没有更有效的方法可以做到这一点?我似乎无法弄清楚如何为此应用“速度”解决方案。有没有一种方法,而不必先保存子集运算的结果并立即按日期将函数应用于前两行?
我们将不胜感激,下面是产生上面数据的代码:
# uses other columns for results/ is dependent on subsetted rows
# but keep it simple for replication
do_something <- function(dt){
res <- ifelse(cumsum(dt$val) > 200, 1, 0)
return(res)
}
# first 2 rows of dt by group=date
x <- dt[, .SD[1:2], by=date]
# apply do_something to first 2 rows of dt by group=date
x[, list('age'=age,'name'=name,'val'=val, 'funcVal'= do_something(.SD[1:2])),by=date]
date age name val funcVal
1: 2000-01-01 3 Andrew 93.73546 0
2: 2000-01-01 4 Ben 101.83643 1
3: 2000-01-02 6 Adam 115.95281 0
4: 2000-01-02 7 Bob 103.29508 1
答案 0 :(得分:5)
如果分组列不止一个,则将其折叠为一个可能更有效:
m = dt[, .(g = .GRP, r = .I[1:2]), by = date]
dt[m$r, v := ff(.SD), by=m$g, .SDcols="val"]
这只是@eddi's approach的扩展(保留行号.I
,在@akrun的答案中可见)也可以保留组计数器.GRP
。
关于OP的评论,他们更加关注该功能,嗯,是从@akrun借来的...
ff = function(x) as.integer(cumsum(x[[1]]) > 200)
假设所有值均为非负数,则可以在C中更有效地处理此问题,因为一旦达到阈值,累积和就可以停止。不过,对于两行的特殊情况,这无关紧要。
我的印象是,这是一个伪函数,因此毫无意义。我通常想到的许多效率改进取决于功能和数据。
答案 1 :(得分:3)
我们可以在as.integer
上使用cumsum
来将逻辑强制转换为二进制。提取行索引,将其指定为i
,按“日期”分组,然后在“ val”列上应用该函数
f1 <- function(x) as.integer(cumsum(x) > 200)
i1 <- dt[, .I[1:2], by = date]$V1
dt[i1, newcol := f1(val), date]