我有一个带有两个变量的数据集df:一个(上升顺序)posixct变量date.time和一个数值变量值。变量值由一系列零或一系列不同于零的正数组成。每个系列的长度是随机的,但大于一个。
编辑:加载lubridate库
library(lubridate)
数据集df:
set.seed(10)
df <- data.frame(date.time=seq(ymd_hms("2016-01-01 00:00:00"),ymd_hms("2016-01-01 01:00:00"),length.out = 20),value=c(runif(3,1,3),rep.int(0,4),runif(5,1,3),rep.int(0,4),runif(4,1,3)))
期望的结果:
desired.outcome <- cbind(df,peak=c(1,1,1,0,0,0,0,2,2,2,2,2,0,0,0,0,3,3,3,3))
我想创建一个名为peak的第三个变量,它将每个大于零的正数序列标识为单独的&#34;峰值&#34;。峰值定义为两个零序列之间的一系列大于零的正数。
答案 0 :(得分:4)
var tempResults = new Dictionary<Record, List<Record>>();
errors = new List<Record>();
foreach (Record record in diag)
{
var code = Convert.ToInt16(Regex.Split(record.Line, @"\s{1,}")[4], 16);
var cond = codes.Where(x => x.Value == code && x.Active).FirstOrDefault();
if (cond == null)
{
errors.Add(record);
continue;
}
var min = record.Datetime.AddSeconds(downDiff);
var max = record.Datetime.AddSeconds(upDiff);
//PROBLEM PART - It takes around 4,5ms
var possibleResults = cas.Where(x => x.Datetime >= min && x.Datetime <= max).ToList();
if (possibleResults.Count == 0)
errors.Add(record);
else
{
if (!CompareCond(record, possibleResults, cond, ref tempResults, false))
{
errors.Add(record);
}
}
}
在一行中执行此操作:
a=rle(df$value>0)
a$values=cumsum(a$values)*a$values
peak=inverse.rle(a)
peak
[1] 1 1 1 0 0 0 0 2 2 2 2 2 0 0 0 0 3 3 3 3
cbind(df,peak)
date.time value peak
1 2016-01-01 00:00:00 2.014956 1
2 2016-01-01 00:03:09 1.613537 1
3 2016-01-01 00:06:18 1.853815 1
4 2016-01-01 00:09:28 0.000000 0
5 2016-01-01 00:12:37 0.000000 0
6 2016-01-01 00:15:47 0.000000 0
7 2016-01-01 00:18:56 0.000000 0
8 2016-01-01 00:22:06 2.386204 2
9 2016-01-01 00:25:15 1.170272 2
10 2016-01-01 00:28:25 1.450873 2
11 2016-01-01 00:31:34 1.549061 2
12 2016-01-01 00:34:44 1.544610 2
13 2016-01-01 00:37:53 0.000000 0
14 2016-01-01 00:41:03 0.000000 0
15 2016-01-01 00:44:12 0.000000 0
16 2016-01-01 00:47:22 0.000000 0
17 2016-01-01 00:50:31 2.231659 3
18 2016-01-01 00:53:41 1.859343 3
19 2016-01-01 00:56:50 2.303311 3
20 2016-01-01 01:00:00 2.135476 3
答案 1 :(得分:1)
也许不是美女:
(无法评估您的数据)
set.seed(10)
value=c(runif(3,1,3),rep.int(0,4),runif(5,1,3),rep.int(0,4),runif(4,1,3))
代码:
library(data.table)
result <- rleidv(value>0)
result[!(value>0)]<-0
result[value>0]<-rleidv(result[value>0])
结果:
#[1] 1 1 1 0 0 0 0 2 2 2 2 2 0 0 0 0 3 3 3 3
答案 2 :(得分:0)
另一个选项(mtd2
)供您考虑:
set.seed(10L)
#generate dataset of 5million rows as OP mentioned
N <- 5e6
df <- data.frame(value=10*runif(N))
#randomly set 25% of values to 0
df[sample(N, 0.25*N), "value"] <- 0
##original dataset of 20 rows
# df <- data.frame(date.time=seq(as.POSIXct("2016-01-01 00:00:00"),as.POSIXct("2016-01-01 01:00:00"),length.out = 20),
# value=c(runif(3,1,3),rep.int(0,4),runif(5,1,3),rep.int(0,4),runif(4,1,3)))
mtd1 <- function() {
inverse.rle(with(a<-rle(df$value>0), modifyList(a, list(values=cumsum(values)*values))))
}
val <- df$value
mtd3 <- function() {
result <- rleidv(val>0)
result[!(val>0)]<-0
result[val>0]<-rleidv(result[val>0])
}
library(data.table)
mtd2 <- function() {
setDT(df)[, peak := (value > 0) * ceiling(rleid(value > 0) / 2)]
}
library(microbenchmark)
microbenchmark(mtd1(), mtd3(), mtd2(), times=5L)
定时:
Unit: milliseconds
expr min lq mean median uq max neval
mtd1() 357.755701 375.957301 517.6211210 610.545700 611.407001 632.439902 5
mtd3() 312.756201 329.190100 385.4440206 329.810201 352.368101 603.095500 5
mtd2() 181.146901 187.001001 256.8531808 215.238501 221.030000 479.849501 5