我有这张桌子(visit_ts) -
Year Month Number_of_visits
2011 4 1
2011 6 3
2011 7 23
2011 12 32
2012 1 123
2012 11 3200
Number_of_visits
的行插入为0,表中缺少的月份。以下代码正常运行 -
vec_month=c(1,2,3,4,5,6,7,8,9,10,11,12)
vec_year=c(2011,2012,2013,2014,2015,2016)
i=1
startyear=head(visit_ts$Year,n=1)
endyear=tail(visit_ts$Year,n=1)
x=head(visit_ts$Month,n=1)
y=tail(visit_ts$Month,n=1)
for (year in vec_year)
{
if(year %in% visit_ts$Year)
{
a=subset(visit_ts,visit_ts$Year==year)
index= which(!vec_month %in% a$Month)
for (j in index)
{
if((year==startyear & j>x )|(year==endyear & j<y))
visit_ts=rbind(visit_ts,c(year,j,0))
else
{
if(year!=startyear & year!=endyear)
visit_ts=rbind(visit_ts,c(year,j,0))
}
}}
else
{
i=i+1
}}
由于我是R的新手,我正在寻找一个替代/更好的解决方案来解决这个问题,这个问题不会涉及对年份和月份矢量进行硬编码。另外,请随意指出最佳编程实践。
答案 0 :(得分:4)
我们可以将expand.grid
与merge
或left_join
library(dplyr)
expand.grid(Year = min(df1$Year):max(df1$Year), Month = 1:12) %>%
filter(!(Year == min(df1$Year) & Month %in% 1:3|
Year == max(df1$Year) & Month == 12)) %>%
left_join(., df1) %>%
mutate(Number_of_visits=replace(Number_of_visits, is.na(Number_of_visits), 0))
# Year Month Number_of_visits
#1 2012 1 123
#2 2012 2 0
#3 2012 3 0
#4 2011 4 1
#5 2012 4 0
#6 2011 5 0
#7 2012 5 0
#8 2011 6 3
#9 2012 6 0
#10 2011 7 23
#11 2012 7 0
#12 2011 8 0
#13 2012 8 0
#14 2011 9 0
#15 2012 9 0
#16 2011 10 0
#17 2012 10 0
#18 2011 11 0
#19 2012 11 3200
#20 2011 12 32
我们可以通过按年份分组使其更具动态性,获得“月份”序列。从list
,unnest
列的最小值到最大值,与原始数据集(left_join
)和replace
的NA值一起加入0。
library(tidyr)
df1 %>%
group_by(Year) %>%
summarise(Month = list(min(Month):max(Month))) %>%
unnest(Month) %>%
left_join(., df1) %>%
mutate(Number_of_visits=replace(Number_of_visits, is.na(Number_of_visits), 0))
# Year Month Number_of_visits
# <int> <int> <dbl>
#1 2011 4 1
#2 2011 5 0
#3 2011 6 3
#4 2011 7 23
#5 2011 8 0
#6 2011 9 0
#7 2011 10 0
#8 2011 11 0
#9 2011 12 32
#10 2012 1 123
#11 2012 2 0
#12 2012 3 0
#13 2012 4 0
#14 2012 5 0
#15 2012 6 0
#16 2012 7 0
#17 2012 8 0
#18 2012 9 0
#19 2012 10 0
#20 2012 11 3200
或另一个选项是data.table
。转换&#39; data.frame&#39;到&#39; data.table&#39; (setDT(df1)
),按&#39;年份&#39;分组,我们得到min
到max
&#39;月&#39;的序列,加入原始数据集{ {1}}&#39;年&#39;和&#39;月&#39;,将NA值替换为0。
on