我在时间序列上使用SparkR,我有一个问题。
经过一些操作后我得到了类似的东西,其中DayHour代表ID的价值日和小时。
DayHour ID Value
01 00 4704 10
01 01 4705 11
.
.
.
04 23 4705 12
问题是我有一些差距,如 01 01 , 01 02 缺失
DayHour ID Value
01 00 4704 13
01 03 4704 12
我必须填写整个数据集中的空白:
DayHour ID Value
01 00 4704 13
01 01 4704 0
01 02 4704 0
01 03 4704 12
Foreach ID我必须填补缺少DayHour的空白,ID和值= 0
R SparkR中的解决方案都很有用。
答案 0 :(得分:1)
我在数据框df_r
>df_r <- data.frame(DayHour=c("01 00","01 01","01 02","01 03","01 06","01 07"),
ID = c(4704,4705,4705,4706,4706,4706),Value=c(10,11,12,13,14,15))
> df_r
DayHour ID Value
1 01 00 4704 10
2 01 01 4705 11
3 01 02 4705 12
4 01 03 4706 13
5 01 06 4706 14
6 01 07 4706 15
where the missing hours are 01 04 and 01 05
#Removing white spaces
>df_r$DayHour <- sub(" ", "", df_r$DayHour)
# create dummy all the 'dayhour' in sequence
x=c(00:23)
y=01:04
all_day_hour <- data.frame(Hour = rep(x,4), Day = rep(y,each=24))
all_day_hour$Hour <- sprintf("%02d", all_day_hour$Hour)
all_day_hour$Day <- sprintf("%02d", all_day_hour$Day)
all_day_hour_1 <- transform(all_day_hour,DayHour=paste0(Day,Hour))
all_day_hour_1 <- all_day_hour_1[c(3)]
# using for loop to filter out by each id
>library(dplyr)
>library(forecast)
>df.new <- data.frame()
>factors=unique(df_r$ID)
>for(i in 1:length(factors))
{
df_r1 <- filter(df_r, ID == factors[i])
#Merge
df_data1<- merge(df_r1, all_day_hour_1, by="DayHour", all=TRUE)
df_data1$Value[which(is.na(df_data1$Value))] <- 0
df.new <- rbind(df.new, df_data1)
}