聚合大型data.frame

时间:2016-04-05 10:22:29

标签: r

我有一个大型数据框(> 1.000.000个条目),其中一列包含日期/时间变量,一列包含数值。问题是某些日期/时间变量出现两次或三次,并且需要对相应的数值进行平均,因此我最终得到每个日期/时间变量的一个数值。

到现在为止,我正在做以下事情:

## audio_together is the dataframe with two colums $timestamp and $amplitude 
## (i.e. the numeric value)

timestamp_unique <- unique(audio_together$timestamp)   ## find all timestamps
  audio_together3 <- c(rep(NA, length(timestamp_unique)))  ## audio_together 3 is the new vector containing the values for each timestamp
  count = 0
  for (k in 1:length(timestamp_unique)){
    temp_time <- timestamp_unique[k]
    if (k==1){
      temp_subset <- audio_together[(1:10),]  ## look for timestamps only in a subset, which definitely contains the timestamp we are looking for
      temp_data_which <- which(temp_subset$timestamp == temp_time)
    } else {
      temp_subset <- audio_together[((count):(count+9)),]
      temp_data_which <- which(temp_subset$timestamp == temp_time)
    }
    if (length(temp_data_which) > 1){
      audio_together3[k] <- mean(temp_subset$amplitude[temp_data_which], na.rm = T)
    } else {
      audio_together3[k] <- temp_subset$amplitude[temp_data_which]
    }
    count <- count + length(temp_data_which)
  }

然而,这个过程仍然很慢。任何明显的想法(即在几分钟的时间范围内)都会加速这个过程?

更新:示例

timestamp <- c("2015-09-03 18:54:13", "2015-09-03 18:54:14", "2015-09-03 18:54:14", "2015-09-03 18:54:15", "2015-09-03 18:54:15", "2015-09-03 18:54:16", "2015-09-03 18:54:16", "2015-09-03 18:54:17", "2015-09-03 18:54:17")
amplitude <- c(200, 313, 321, 432, 111, 423, 431, 112, 421)

audio_together <- data.frame(timestamp, amplitude)

2 个答案:

答案 0 :(得分:0)

如果没有minimal reproducible example,则很难进行测试,但如果您打算平均所有amplitude共享相同的timestamp,则此dplyr解决方案可能有所帮助:

library(dplyr)
audio_together %>% 
  group_by(timestamp) %>% 
  summarize(av_amplitude=mean(amplitude, na.rm=T)) %>% 
  ungroup()

答案 1 :(得分:0)

感谢您的想法。

以下作品完美无缺:

require(dplyr)
audio_together <- audio_together %>% group_by(timestamp)
audio_together <- ungroup(audio_together %>% summarise(mean(amplitude, na.rm=T)))