在时间序列中插入缺失观察的NA以获得正确的线图

时间:2017-12-12 10:10:14

标签: r ggplot2 missing-data

我有不同群体的时间序列,比如缺少某些值:

library(tidyverse)

df <- tribble(
  ~year, ~country, ~variable, 
  #--|--|----
  2003, "USA", 44,
  2004, "USA", 40,
  2005, "USA", 30,
  # 2006 for USA is missing!
  # 2007 for USA is missing!
  # 2008 for USA is missing!
  2009, "USA", 39,
  2010, "USA", 55,
  2011, "USA", 53,
  2012, "USA", 71,
  # 2003 for FRA is missing!
  # 2004 for FRA is missing!
  2005, "FRA", 10,
  2006, "FRA", 8,
  2007, "FRA", 13,
  2008, "FRA", 12,
  2009, "FRA", 18,
  2010, "FRA", 39
  # 2011 for FRA is missing!
  # 2012 for FRA is missing!
)

当我绘制我的系列时,即使我一年没有观察,geom_line()也会连接这些线:

ggplot(df, aes(year, variable, color = country)) +
  geom_line()

enter image description here

“FRA”看起来很好,因为缺少的数据是在开头和结尾,但对于“US”,我不想在2006年到2008年连接线路。

我想要的是以下内容:

df <- tribble(
  ~year, ~country, ~variable, 
  #--|--|----
  2003, "USA", 44,
  2004, "USA", 40,
  2005, "USA", 30,
  2006, "USA", NA, # explicitly missing!
  2007, "USA", NA, # explicitly missing!
  2008, "USA", NA, # explicitly missing!
  2009, "USA", 39,
  2010, "USA", 55,
  2011, "USA", 53,
  2012, "USA", 71,
  2003, "FRA", NA, # explicitly missing!
  2004, "FRA", NA, # explicitly missing!
  2005, "FRA", 10,
  2006, "FRA", 8,
  2007, "FRA", 13,
  2008, "FRA", 12,
  2009, "FRA", 18,
  2010, "FRA", 39,
  2011, "FRA", NA, # explicitly missing!
  2012, "FRA", NA # explicitly missing!
)

ggplot(df, aes(year, variable, color = country)) +
  geom_line()

这使得:

enter image description here

在我的真实数据集中我有很多组和日期,所以只需在正确的地方手动插入NA即可。

我尝试使用正确的日期列表进行一些合并,但这并没有解决它:

df %>% 
  right_join(tibble(year = seq(2003, 2012)))

有什么想法吗?

3 个答案:

答案 0 :(得分:3)

您可以使用expand.grid在数据框中自动创建缺失值:

df2 = expand.grid(year=unique(df$year),country=unique(df$country)) %>% left_join(df)

ggplot(df2, aes(year, variable, color = country)) +
  geom_line()

df2将如下所示:

   year country variable
1  2003     USA       44
2  2004     USA       40
3  2005     USA       30
4  2009     USA       39
5  2010     USA       55
6  2011     USA       53
7  2012     USA       71
8  2006     USA       NA
9  2007     USA       NA
10 2008     USA       NA
11 2003     FRA       NA
12 2004     FRA       NA
13 2005     FRA       10
14 2009     FRA       18
15 2010     FRA       39
16 2011     FRA       NA
17 2012     FRA       NA
18 2006     FRA        8
19 2007     FRA       13
20 2008     FRA       12

以及由此产生的情节:

enter image description here

希望这有帮助!

答案 1 :(得分:0)

问题不在于ggplot,而在于您的数据。解决方案是在绘制数据之前进行合并。创建包含所有年份和国家/地区的数据集。

E.g。 all_yr <- data.frame(year = 2000:2010, countries = c("CountryA","CountryB","CountryZ")

然后,在真实数据集和此完整数据集(all_yr)之间进行合并。 merge应包含all_yr数据集中包含的所有年份和国家/地区。 real_data集中遗漏的内容将填充NA

E.g。 merge(all_yr, real_data, by= year, all.x = TRUE)

答案 2 :(得分:0)

这对我有用:

set.seed(357)
xy <- data.frame(year = c(2003:2005, 2009:2012, 2005:2010),
                 country = c(rep("USA", 7), rep("FR", 6)),
                 vrbl = rnorm(7+6))

sxy <- split(xy, f = xy$country)
mxy <- data.frame(year = 2003:2012)

out <- sapply(sxy, FUN = function(x, mxy) {
  out <- merge(x = mxy, y = x, all = TRUE)
  out$country <- unique(x$country)
  out
}, mxy = mxy, simplify = FALSE)
out <- do.call(rbind, out)

library(ggplot2)

ggplot(out, aes(x = year, y = vrbl, color = country)) +
  theme_bw() +
  geom_line()

       year country        vrbl
FR.1   2003      FR          NA
FR.2   2004      FR          NA
FR.3   2005      FR  0.22703071
FR.4   2006      FR -0.46901506
FR.5   2007      FR  0.47652129
FR.6   2008      FR -0.91164798
FR.7   2009      FR -0.34177516
FR.8   2010      FR  0.54674134
FR.9   2011      FR          NA
FR.10  2012      FR          NA
USA.1  2003     USA -1.24111731
USA.2  2004     USA -0.58320499
USA.3  2005     USA  0.39474705
USA.4  2006     USA          NA
USA.5  2007     USA          NA
USA.6  2008     USA          NA
USA.7  2009     USA  1.50421107
USA.8  2010     USA  0.76679974
USA.9  2011     USA  0.31746044
USA.10 2012     USA -0.09997594