重建填充了空白的data.frame

时间:2014-11-22 01:49:24

标签: r

我是R的新手,正在开展一个我需要帮助的项目。

我有一个包含一年数据的CSV文件。然而,时间序列中存在一些差距,我需要每隔半小时均匀间隔一次(每天48行,一年365天将在一整年内制作17520行数据)。差距从1个半小时到几天不等。这些丢失的时间戳不存在行。所以,我已经使用了一些其他论坛帖子来帮助我创建一个脚本,将CSV导入R,通过创建行使时间戳列的长度正确,然后将数据与新的时间戳列匹配。 / p>

但是,我有大约3列数据与新时间戳匹配,而我现在这样做的方式非常低效。截至目前,data.frame(newdata4)存在正确的时间戳。然后,我使用missing4 data.frame:

中的原始数据向该帧添加一个新列
newdata4 <- as.data.frame(timestamp_corr)
newdata4$PAR_in_Avg <- missing4$PAR_in_Avg[pmatch(newdata4$timestamp_corr, missing4$timestamp)] # add data where there was an original timestamp
newdata4$PAR_in_Avg[is.na(newdata4$PAR_in_Avg)] <- -9999 # replace NAs with -9999

在此示例中,PAR_in_Avg是原始CSV文件中的一列。这非常有效。但是,为了将所有列都放入newdata4中,我一遍又一遍地重复这些行:

newdata4$PAR_in_Avg <- missing4$PAR_in_Avg[pmatch(newdata4$timestamp_corr, missing4$timestamp)] # add data where there was an original timestamp
newdata4$PAR_in_Avg[is.na(newdata4$PAR_in_Avg)] <- -9999 # replace NAs with -9999
newdata4$PAR_out_Avg <- missing4$PAR_out_Avg[pmatch(newdata4$timestamp_corr, missing4$timestamp)] # add data where there was an original timestamp
newdata4$PAR_out_Avg[is.na(newdata4$PAR_out_Avg)] <- -9999 # replace NAs with -9999
newdata4$Rn_meas_Avg <- missing4$Rn_meas_Avg[pmatch(newdata4$timestamp_corr, missing4$timestamp)] # add data where there was an original timestamp
newdata4$Rn_meas_Avg[is.na(newdata4$Rn_meas_Avg)] <- -9999 # replace NAs with -9999
newdata4$PYRA_CMP3_Avg <- missing4$PYRA_CMP3_Avg[pmatch(newdata4$timestamp_corr, missing4$timestamp)] # add data where there was an original timestamp
newdata4$PYRA_CMP3_Avg[is.na(newdata4$PYRA_CMP3_Avg)] <- -9999 # replace NAs with -9999
newdata4$G_1_Avg <- missing4$G_1_Avg[pmatch(newdata4$timestamp_corr, missing4$timestamp)] # add data where there was an original timestamp
newdata4$G_1_Avg[is.na(newdata4$G_1_Avg)] <- -9999 # replace NAs with -9999
newdata4$G_2_Avg <- missing4$G_2_Avg[pmatch(newdata4$timestamp_corr, missing4$timestamp)] # add data where there was an original timestamp
newdata4$G_2_Avg[is.na(newdata4$G_2_Avg)] <- -9999 # replace NAs with -9999
newdata4$G_3_Avg <- missing4$G_3_Avg[pmatch(newdata4$timestamp_corr, missing4$timestamp)] # add data where there was an original timestamp
newdata4$G_3_Avg[is.na(newdata4$G_3_Avg)] <- -9999 # replace NAs with -9999
newdata4$G_4_Avg <- missing4$G_4_Avg[pmatch(newdata4$timestamp_corr, missing4$timestamp)] # add data where there was an original timestamp
newdata4$G_4_Avg[is.na(newdata4$G_4_Avg)] <- -9999 # replace NAs with -9999

这是不可持续的,因为我必须与其他网站和其他年份(每个具有不同的列标题)这样做。理想情况下,我希望R读取此CSV文件的第一行以确定有多少列,然后在构建新时间序列后使用pmatch将每个列添加回来。

我能够合并newdata4 data.frame和原始的missing4 data.frame,但这样做会删除刚为差距创建的所有行。

是否有一些简单的方法将数据重新组合在一起并不需要重复?

1 个答案:

答案 0 :(得分:0)

尝试

newdat <- data.frame(timestamp=with(dat, seq(min(timestamp),
                     max(timestamp), by='30 min')))

dat1 <- merge(dat, newdat, by='timestamp', all=TRUE)
indx <- setdiff(colnames(dat1), 'timestamp')
dat1[indx][is.na(dat1[indx])] <- -9999
head(dat1)

数据

set.seed(42)
dat <- data.frame(timestamp= sort(sample(seq(as.POSIXct('1996-01-01'),
    length.out=50, by='30 min'),30, replace=FALSE)), value1=rnorm(30),
    value2=runif(30))