展开行以获取缺少的观察

时间:2018-12-07 01:01:31

标签: r

我有一个数据集,该数据集是使用100 m样线收集的。在数据收集过程中,如果整个样条线都未检测到,则记录为“ 0”。如果检测到某些东西,则沿横断面以20 m的间隔(20、40、60、80、100)记录“ 1”。例如;

Location Year Month Visit Air.Temp Transect Distance Present
Site1    2015 Feb   1     22.5      A       20       1
Site1    2015 Feb   1     22.5      A       40       1
Site1    2015 Feb   1     22.5      A       80       1
Site1    2015 Feb   1     23.0      B       20       1
Site1    2015 Feb   1     21.5      C       100      0
Site1    2015 Feb   2     24.0      A       80       1

我想扩展我的数据集,使其在搜索的每20 m间隔中包含一行,在未记录任何内容的距离处添加一个“ 0”,并保持与该特定样条关联的数据(例如网站,年,月,访问,温度等)。例如,我上面的期望输出是;

Location Year Month Visit Air.temp Transect Distance Present
Site1    2015 Feb   1     22.5      A       20       1
Site1    2015 Feb   1     22.5      A       40       1
Site1    2015 Feb   1     22.5      A       60       0
Site1    2015 Feb   1     22.5      A       80       1
Site1    2015 Feb   1     22.5      A       100      0 
Site1    2015 Feb   1     23.0      B       20       1
Site1    2015 Feb   1     23.0      B       40       0
Site1    2015 Feb   1     23.0      B       60       0
Site1    2015 Feb   1     23.0      B       80       0
Site1    2015 Feb   1     23.0      B       100      0
Site1    2015 Feb   1     21.5      C       20       0
Site1    2015 Feb   1     21.5      C       40       0
Site1    2015 Feb   1     21.5      C       60       0
Site1    2015 Feb   1     21.5      C       80       0
Site1    2015 Feb   1     21.5      C       100      0
Site1    2015 Feb   2     24.0      A       20       0
Site1    2015 Feb   2     24.0      A       40       0
Site1    2015 Feb   2     24.0      A       60       0
Site1    2015 Feb   2     24.0      A       80       1
Site1    2015 Feb   2     24.0      A       100      0

我尝试了一个expand.grid方法,该方法已针对类似的问题提出了建议,但是在我的情况下,由于它尝试生成的数据帧太大(实际上,我的数据集具有更多测量的列),它引发了内存错误变量和> 1000行)。

非常感谢您的帮助! 谢谢。

1 个答案:

答案 0 :(得分:0)

选项1:对full_join的数据使用nest

skel <- data.frame(Distance = seq(20, 100, 20))

library(tidyverse)
df %>%
    group_by_at(vars(Location:Transect)) %>%
    nest() %>%
    mutate(data = map(data, ~full_join(.x, skel, by = "Distance"))) %>%
    unnest() %>%
    replace_na(list(Present = 0))
## A tibble: 20 x 8
#   Location  Year Month Visit Air.Temp Transect Distance Present
#   <fct>    <int> <fct> <int>    <dbl> <fct>       <dbl>   <dbl>
# 1 Site1     2015 Feb       1     22.5 A              20       1
# 2 Site1     2015 Feb       1     22.5 A              40       1
# 3 Site1     2015 Feb       1     22.5 A              80       1
# 4 Site1     2015 Feb       1     22.5 A              60       0
# 5 Site1     2015 Feb       1     22.5 A             100       0
# 6 Site1     2015 Feb       1     23   B              20       1
# 7 Site1     2015 Feb       1     23   B              40       0
# 8 Site1     2015 Feb       1     23   B              60       0
# 9 Site1     2015 Feb       1     23   B              80       0
#10 Site1     2015 Feb       1     23   B             100       0
#11 Site1     2015 Feb       1     21.5 C             100       0
#12 Site1     2015 Feb       1     21.5 C              20       0
#13 Site1     2015 Feb       1     21.5 C              40       0
#14 Site1     2015 Feb       1     21.5 C              60       0
#15 Site1     2015 Feb       1     21.5 C              80       0
#16 Site1     2015 Feb       2     24   A              80       1
#17 Site1     2015 Feb       2     24   A              20       0
#18 Site1     2015 Feb       2     24   A              40       0
#19 Site1     2015 Feb       2     24   A              60       0
#20 Site1     2015 Feb       2     24   A             100       0

选项2:使用tidyr::complete

library(tidyverse)
df %>%
    group_by_at(vars(Location:Transect)) %>%
    mutate(Distance = factor(Distance, levels = seq(20, 100, 20))) %>%
    complete(Distance, fill = list(Present = 0)) %>%
    mutate(Distance = as.integer(as.character(Distance)))
## A tibble: 20 x 8
## Groups:   Location, Year, Month, Visit, Air.Temp, Transect [4]
#   Location  Year Month Visit Air.Temp Transect Distance Present
#   <fct>    <int> <fct> <int>    <dbl> <fct>       <int>   <dbl>
# 1 Site1     2015 Feb       1     21.5 C              20       0
# 2 Site1     2015 Feb       1     21.5 C              40       0
# 3 Site1     2015 Feb       1     21.5 C              60       0
# 4 Site1     2015 Feb       1     21.5 C              80       0
# 5 Site1     2015 Feb       1     21.5 C             100       0
# 6 Site1     2015 Feb       1     22.5 A              20       1
# 7 Site1     2015 Feb       1     22.5 A              40       1
# 8 Site1     2015 Feb       1     22.5 A              60       0
# 9 Site1     2015 Feb       1     22.5 A              80       1
#10 Site1     2015 Feb       1     22.5 A             100       0
#11 Site1     2015 Feb       1     23   B              20       1
#12 Site1     2015 Feb       1     23   B              40       0
#13 Site1     2015 Feb       1     23   B              60       0
#14 Site1     2015 Feb       1     23   B              80       0
#15 Site1     2015 Feb       1     23   B             100       0
#16 Site1     2015 Feb       2     24   A              20       0
#17 Site1     2015 Feb       2     24   A              40       0
#18 Site1     2015 Feb       2     24   A              60       0
#19 Site1     2015 Feb       2     24   A              80       1
#20 Site1     2015 Feb       2     24   A             100       0

选项2的缺点是我们需要将Distance转换为factor,然后再转换回integer(或numeric)。