我有不同群体的时间序列,比如缺少某些值:
library(tidyverse)
df <- tribble(
~year, ~country, ~variable,
#--|--|----
2003, "USA", 44,
2004, "USA", 40,
2005, "USA", 30,
# 2006 for USA is missing!
# 2007 for USA is missing!
# 2008 for USA is missing!
2009, "USA", 39,
2010, "USA", 55,
2011, "USA", 53,
2012, "USA", 71,
# 2003 for FRA is missing!
# 2004 for FRA is missing!
2005, "FRA", 10,
2006, "FRA", 8,
2007, "FRA", 13,
2008, "FRA", 12,
2009, "FRA", 18,
2010, "FRA", 39
# 2011 for FRA is missing!
# 2012 for FRA is missing!
)
当我绘制我的系列时,即使我一年没有观察,geom_line()
也会连接这些线:
ggplot(df, aes(year, variable, color = country)) +
geom_line()
“FRA”看起来很好,因为缺少的数据是在开头和结尾,但对于“US”,我不想在2006年到2008年连接线路。
我想要的是以下内容:
df <- tribble(
~year, ~country, ~variable,
#--|--|----
2003, "USA", 44,
2004, "USA", 40,
2005, "USA", 30,
2006, "USA", NA, # explicitly missing!
2007, "USA", NA, # explicitly missing!
2008, "USA", NA, # explicitly missing!
2009, "USA", 39,
2010, "USA", 55,
2011, "USA", 53,
2012, "USA", 71,
2003, "FRA", NA, # explicitly missing!
2004, "FRA", NA, # explicitly missing!
2005, "FRA", 10,
2006, "FRA", 8,
2007, "FRA", 13,
2008, "FRA", 12,
2009, "FRA", 18,
2010, "FRA", 39,
2011, "FRA", NA, # explicitly missing!
2012, "FRA", NA # explicitly missing!
)
ggplot(df, aes(year, variable, color = country)) +
geom_line()
这使得:
在我的真实数据集中我有很多组和日期,所以只需在正确的地方手动插入NA
即可。
我尝试使用正确的日期列表进行一些合并,但这并没有解决它:
df %>%
right_join(tibble(year = seq(2003, 2012)))
有什么想法吗?
答案 0 :(得分:3)
您可以使用expand.grid在数据框中自动创建缺失值:
df2 = expand.grid(year=unique(df$year),country=unique(df$country)) %>% left_join(df)
ggplot(df2, aes(year, variable, color = country)) +
geom_line()
df2将如下所示:
year country variable
1 2003 USA 44
2 2004 USA 40
3 2005 USA 30
4 2009 USA 39
5 2010 USA 55
6 2011 USA 53
7 2012 USA 71
8 2006 USA NA
9 2007 USA NA
10 2008 USA NA
11 2003 FRA NA
12 2004 FRA NA
13 2005 FRA 10
14 2009 FRA 18
15 2010 FRA 39
16 2011 FRA NA
17 2012 FRA NA
18 2006 FRA 8
19 2007 FRA 13
20 2008 FRA 12
以及由此产生的情节:
希望这有帮助!
答案 1 :(得分:0)
问题不在于ggplot
,而在于您的数据。解决方案是在绘制数据之前进行合并。创建包含所有年份和国家/地区的数据集。
E.g。 all_yr <- data.frame(year = 2000:2010, countries = c("CountryA","CountryB","CountryZ")
然后,在真实数据集和此完整数据集(all_yr
)之间进行合并。 merge
应包含all_yr
数据集中包含的所有年份和国家/地区。 real_data
集中遗漏的内容将填充NA
。
E.g。 merge(all_yr, real_data, by= year, all.x = TRUE)
答案 2 :(得分:0)
这对我有用:
set.seed(357)
xy <- data.frame(year = c(2003:2005, 2009:2012, 2005:2010),
country = c(rep("USA", 7), rep("FR", 6)),
vrbl = rnorm(7+6))
sxy <- split(xy, f = xy$country)
mxy <- data.frame(year = 2003:2012)
out <- sapply(sxy, FUN = function(x, mxy) {
out <- merge(x = mxy, y = x, all = TRUE)
out$country <- unique(x$country)
out
}, mxy = mxy, simplify = FALSE)
out <- do.call(rbind, out)
library(ggplot2)
ggplot(out, aes(x = year, y = vrbl, color = country)) +
theme_bw() +
geom_line()
year country vrbl
FR.1 2003 FR NA
FR.2 2004 FR NA
FR.3 2005 FR 0.22703071
FR.4 2006 FR -0.46901506
FR.5 2007 FR 0.47652129
FR.6 2008 FR -0.91164798
FR.7 2009 FR -0.34177516
FR.8 2010 FR 0.54674134
FR.9 2011 FR NA
FR.10 2012 FR NA
USA.1 2003 USA -1.24111731
USA.2 2004 USA -0.58320499
USA.3 2005 USA 0.39474705
USA.4 2006 USA NA
USA.5 2007 USA NA
USA.6 2008 USA NA
USA.7 2009 USA 1.50421107
USA.8 2010 USA 0.76679974
USA.9 2011 USA 0.31746044
USA.10 2012 USA -0.09997594