Question

我是R的新手，我的问题是我知道我需要做什么，而不是如何在R中做。我有一个非常大的数据框来自Web服务负载测试，~20M观察。我有以下变量：

epochtime, uri, cache (hit or miss)

我想我需要做一些事情。我需要为前50个不同的URI子集我的数据框，然后为每个子集中的每个观察计算该时间点的％缓存命中。最终目标是URI

随时间缓存命中/未命中％的图

我已经阅读了，我仍在阅读有关此主题的各种帖子，但R很新，我有截止日期。我很感激我能得到任何帮助

编辑：

我无法提供确切的数据，但它看起来像这样，它至少从Mongo数据库中检索到的20M观测值。时间是时代，我们每秒记录数千，所以时间有很多，这是预期的。可能有超过50个uri，我只关心前50个。最终结果将是％TCP_HIT随时间的线图到URI的总出现次数。希望更清楚

time                uri                 action
1355683900          /some/uri           TCP_HIT
1355683900          /some/other/uri     TCP_HIT 
1355683905          /some/other/uri     TCP_MISS
1355683906          /some/uri           TCP_MISS

Answer 1

您正在寻找aggregate功能。

调用您的数据框u：

> u
        time             uri   action
1 1355683900       /some/uri  TCP_HIT
2 1355683900 /some/other/uri  TCP_HIT
3 1355683905 /some/other/uri TCP_MISS
4 1355683906       /some/uri TCP_MISS

以下是子集的命中率（使用因子级别的顺序，TCP_HIT = 1，TCP_MISS = 2，默认使用字母顺序），间隔为10秒：

ratio <- function(u) aggregate(u$action ~ u$time %/% 10,
         FUN=function(x) sum((2-as.numeric(x))/length(x)))

现在使用lapply来获得最终结果：

lapply(seq_along(levels(u$uri)),
    function(l) list(uri=levels(u$uri)[l],
     hits=ratio(u[as.numeric(u$uri) == l,])))


[[1]]
[[1]]$uri
[1] "/some/other/uri"

[[1]]$hits
  u$time%/%10 u$action
1   135568390      0.5


[[2]]
[[2]]$uri
[1] "/some/uri"

[[2]]$hits
  u$time%/%10 u$action
1   135568390      0.5

或者在计算比率之前按URI过滤数据帧。

Answer 2

@ MatthewLundberg的代码是正确的想法。具体来说，您需要使用split-apply-combine策略的东西。

考虑到数据的大小，我会看一下data.table包。

你可以看到为什么直观here - data.table只是更快。

Answer 3

认为将我的解决方案分享给他们的绘图部分问题会很有用。

我的R“noobness”我的光芒在这里，但这就是我想出来的。它是一个基本的线图。它绘制了实际值，我没有做任何转换。

for ( i in 1:length(h)) {
  name <- unlist(h[[i]][1])  
  dftemp <- as.data.frame(do.call(rbind,h[[i]][2]))
  names(dftemp) <-  c("time", "cache")
  plot(dftemp$time,dftemp$cache, type="o")
  title(main=name)
}

计算非常大的数据帧随时间的百分比

3 个答案: