使用geom_line()连接所选NA的点

时间:2014-12-28 12:00:31

标签: r ggplot2

我的问题与Connecting across missing values with geom_line密切相关,但它是后续行动而不是重复。

我的数据缺失值为NA。数据已被融化'包含reshape2的长格式,我使用ggplot2绘制geom_points()geom_line()。在示例数据中,我只有一个组,在我的实际数据中,我有几个组。我想绘制一个geom_line()个连接数据点,这些数据点之间没有超过4年的缺失数据。换句话说,如果有3个相邻行包含NA,则将na.rm应用于data.frame,而如果至少有4个相邻行包含NA,则不要将na.rm应用于data.frame

编辑:注意:我正在复制一本书中的数字,即使数据丢失,这些数据也会连接在一起。最好为连接缺失数据的段使用不同的linetypecolour,以及解释它的图例中的注释。

在下文中,我有一个非常乏味和丑陋的黑客,无法扩展到操纵大量数据。我很感激一种更简单的方法,特别希望找到一种简单的方法来计算数据中连续NA的实例。

### ggplot draws geom_line with NAs

# Data (real-world example, so not exactly MWE)
df <- 
structure(list(Year = c(1910, 1911, 1912, 1913, 1914, 1915, 1916, 
1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 
1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 
1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 
1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 
1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 
1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 
1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 
1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 
2005, 2006, 2007, 2008, 2009, 2010), variable = structure(c(2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L), .Label = c("France", "Germany", "Sweden", "Japan"
), class = c("ordered", "factor")), value = c(0.1724, 0.1748, 
0.1752, 0.1777, 0.1778, 0.1953, 0.2132, 0.2242, 0.222, 0.1947, 
NA, NA, NA, NA, NA, 0.113, 0.113, 0.115, 0.112, 0.111, NA, NA, 
0.114, 0.109, 0.113, 0.12, 0.137, 0.15, 0.163, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, 0.116, NA, NA, NA, NA, NA, NA, 0.11, 
NA, NA, NA, 0.122, NA, NA, NA, 0.122, NA, NA, 0.112, NA, NA, 
0.113, NA, NA, 0.101, NA, NA, 0.102, NA, NA, 0.1043, NA, NA, 
0.0906, NA, NA, 0.0964, NA, NA, 0.1052, NA, NA, 0.1043, NA, NA, 
0.1005, NA, NA, 0.1088, NA, NA, 0.101139312657167, 0.0950290025146689, 
0.0901042749371333, 0.09, 0.107249622799665, 0.108891198658843, 
0.115913495389774, 0.110684772282761, 0.113299133836267, 0.111991953059514
)), .Names = c("Year", "variable", "value"), row.names = 102:202, class = "data.frame")

默认情节:

library("ggplot2")
ggplot(data = df, aes(x = Year, y = value, group = variable, colour = variable, shape = variable)) + 
    geom_point(size = 3) + geom_line()

enter image description here

删除了所有NAs的情节(参见Connecting across missing values with geom_line):

ggplot(data = df, aes(x = Year, y = value, group = variable, colour = variable, shape = variable)) + 
    geom_point(size = 3) + geom_line(data = df[!is.na(df$value), ])

enter image description here

所需的情节:

df2 <- df
df2[df2$Year == 1922, ]$value <- "-999999"
df2[df2$Year == 1948, ]$value <- "-999999"
df2 <- df2[!is.na(df2$value), ]
df2$value <- as.numeric(df2$value)
ggplot(data = df2, aes(x = Year, y = value, group = variable, colour = variable, shape = variable)) + geom_point(size = 3) + 
    geom_line() + scale_y_continuous(limit = c(.08, .23))

enter image description here

1 个答案:

答案 0 :(得分:3)

这会产生您想要的情节&#34;,注释中会有例外情况。

x <- rle(!is.na(df$value))
x$values[which(x$lengths>3 & !x$values)] <- TRUE
indx <- inverse.rle(x)
library(ggplot2)
ggplot(df[indx,],aes(x=Year,y=value,color=variable))+
  geom_point(size=3)+
  geom_line()

基本上,我们将NA编码为FALSE,将其他所有内容编码为TRUE,然后执行游程编码以识别T/F的序列。 FALSE长度的任何序列&gt;应该保留3,所以我们将它们转换为TRUE(好像它们不是NA),然后我们使用inverse rle来恢复带有TRUE的索引向量,如果该行应该保留。最后,我们将其应用于df,以便在ggplot中使用。