如何在r中的for循环中取一个数据帧子集的均值

时间:2017-01-26 08:01:33

标签: r

我有一个包含203614行和3列的大型数据集,其名称为" price"," Timestamp",energy。而时间戳有每个交易的回复

dataset

价格是数字

时间戳位于posixct

能量数字

dput(head(dataset))

structure(list(Price = c(18, 20, 23, 15, 15, 15), Timestamp.Transaction = structure(c(1388500200, 1388500200, 1388502000, 1388502000, 1388502000, 1388502000), class = c("POSIXct", "POSIXt"), tzone = ""), Energy = c(414, 230, 3, 3, 3, 3)), .Names = c("Price", "Timestamp.Transaction", "Energy"), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))

我必须通过应用循环来执行以下步骤

1)我必须使用" timestamp"对数据集进行子集化。与某些时间戳相差1.33天

2)计算子集中价格的最小值,最大值,平均值并将其分配给新数据帧

3)我必须每隔15分钟迭代上述步骤

注意:m1是我的数据集

t1是时间戳向量,因为它具有重复值,我只从其中获取唯一值

t1 <- unique(timestamp)

我已经尝试了这个但是它花了很多时间编译并且结果是错误的

    for(i in 125:length(t1)){    for(j in 1:203614){    s1[j,] <- subset(m1,(m1$Timestamp.Transaction <=t1[i] & m1$Timestamp.Transaction >= t1[i]-115200 )   }}

2 个答案:

答案 0 :(得分:0)

# You should set timestamps as the vector of all "certain timestamps" and max.time.diff to "1.33 days"
# I assume there is a subtraction operator for posixct, which produces a number (check it!), if not, use as.double
# timestamps <- ...
# max.time.diff <- ...
len <- length(timestamps)
mins <- rep(NA, len)
maxs <- mins
means <- mins
for (i in seq(len)) {
    timestamp <- timestamps[i]
    prices <- m1$Price[abs(m1$Timestamp - timestamp) <= max.time.diff] 
    mins[i] <- min(prices)
    maxs[i] <- max(prices)
    means[i] <- mean(prices)
}

答案 1 :(得分:0)

您可以将子集放在带

的列表中
newdf <- lapply(t1, function(x) 
  subset(dataset, dataset$Timestamp.Transaction <=x & dataset$Timestamp.Transaction >= x-115200))

然后获取所有子集的summary() - 列的Price列表

summaries <- lapply(newdf, function(x) summary(x["Price"]))

输出:

[[1]]
     Price     
 Min.   :18.0  
 1st Qu.:18.5  
 Median :19.0  
 Mean   :19.0  
 3rd Qu.:19.5  
 Max.   :20.0  

[[2]]
     Price      
 Min.   :15.00  
 1st Qu.:15.00  
 Median :16.50  
 Mean   :17.67  
 3rd Qu.:19.50  
 Max.   :23.00

要命名摘要条目,只需使用

names(summaries) <- sapply(t1, function(x) paste(x-115200, x, sep = " - "))

新输出:

$`2013-12-30 07:30:00 - 2013-12-31 15:30:00`
     Price     
 Min.   :18.0  
 1st Qu.:18.5  
 Median :19.0  
 Mean   :19.0  
 3rd Qu.:19.5  
 Max.   :20.0  

$`2013-12-30 08:00:00 - 2013-12-31 16:00:00`
     Price      
 Min.   :15.00  
 1st Qu.:15.00  
 Median :16.50  
 Mean   :17.67  
 3rd Qu.:19.50  
 Max.   :23.00  

这应该比使用for() - 循环更快。