我有一个数据集,该数据集是使用100 m样线收集的。在数据收集过程中,如果整个样条线都未检测到,则记录为“ 0”。如果检测到某些东西,则沿横断面以20 m的间隔(20、40、60、80、100)记录“ 1”。例如;
Location Year Month Visit Air.Temp Transect Distance Present
Site1 2015 Feb 1 22.5 A 20 1
Site1 2015 Feb 1 22.5 A 40 1
Site1 2015 Feb 1 22.5 A 80 1
Site1 2015 Feb 1 23.0 B 20 1
Site1 2015 Feb 1 21.5 C 100 0
Site1 2015 Feb 2 24.0 A 80 1
我想扩展我的数据集,使其在搜索的每20 m间隔中包含一行,在未记录任何内容的距离处添加一个“ 0”,并保持与该特定样条关联的数据(例如网站,年,月,访问,温度等)。例如,我上面的期望输出是;
Location Year Month Visit Air.temp Transect Distance Present
Site1 2015 Feb 1 22.5 A 20 1
Site1 2015 Feb 1 22.5 A 40 1
Site1 2015 Feb 1 22.5 A 60 0
Site1 2015 Feb 1 22.5 A 80 1
Site1 2015 Feb 1 22.5 A 100 0
Site1 2015 Feb 1 23.0 B 20 1
Site1 2015 Feb 1 23.0 B 40 0
Site1 2015 Feb 1 23.0 B 60 0
Site1 2015 Feb 1 23.0 B 80 0
Site1 2015 Feb 1 23.0 B 100 0
Site1 2015 Feb 1 21.5 C 20 0
Site1 2015 Feb 1 21.5 C 40 0
Site1 2015 Feb 1 21.5 C 60 0
Site1 2015 Feb 1 21.5 C 80 0
Site1 2015 Feb 1 21.5 C 100 0
Site1 2015 Feb 2 24.0 A 20 0
Site1 2015 Feb 2 24.0 A 40 0
Site1 2015 Feb 2 24.0 A 60 0
Site1 2015 Feb 2 24.0 A 80 1
Site1 2015 Feb 2 24.0 A 100 0
我尝试了一个expand.grid方法,该方法已针对类似的问题提出了建议,但是在我的情况下,由于它尝试生成的数据帧太大(实际上,我的数据集具有更多测量的列),它引发了内存错误变量和> 1000行)。
非常感谢您的帮助! 谢谢。
答案 0 :(得分:0)
full_join
的数据使用nest
skel <- data.frame(Distance = seq(20, 100, 20))
library(tidyverse)
df %>%
group_by_at(vars(Location:Transect)) %>%
nest() %>%
mutate(data = map(data, ~full_join(.x, skel, by = "Distance"))) %>%
unnest() %>%
replace_na(list(Present = 0))
## A tibble: 20 x 8
# Location Year Month Visit Air.Temp Transect Distance Present
# <fct> <int> <fct> <int> <dbl> <fct> <dbl> <dbl>
# 1 Site1 2015 Feb 1 22.5 A 20 1
# 2 Site1 2015 Feb 1 22.5 A 40 1
# 3 Site1 2015 Feb 1 22.5 A 80 1
# 4 Site1 2015 Feb 1 22.5 A 60 0
# 5 Site1 2015 Feb 1 22.5 A 100 0
# 6 Site1 2015 Feb 1 23 B 20 1
# 7 Site1 2015 Feb 1 23 B 40 0
# 8 Site1 2015 Feb 1 23 B 60 0
# 9 Site1 2015 Feb 1 23 B 80 0
#10 Site1 2015 Feb 1 23 B 100 0
#11 Site1 2015 Feb 1 21.5 C 100 0
#12 Site1 2015 Feb 1 21.5 C 20 0
#13 Site1 2015 Feb 1 21.5 C 40 0
#14 Site1 2015 Feb 1 21.5 C 60 0
#15 Site1 2015 Feb 1 21.5 C 80 0
#16 Site1 2015 Feb 2 24 A 80 1
#17 Site1 2015 Feb 2 24 A 20 0
#18 Site1 2015 Feb 2 24 A 40 0
#19 Site1 2015 Feb 2 24 A 60 0
#20 Site1 2015 Feb 2 24 A 100 0
tidyr::complete
library(tidyverse)
df %>%
group_by_at(vars(Location:Transect)) %>%
mutate(Distance = factor(Distance, levels = seq(20, 100, 20))) %>%
complete(Distance, fill = list(Present = 0)) %>%
mutate(Distance = as.integer(as.character(Distance)))
## A tibble: 20 x 8
## Groups: Location, Year, Month, Visit, Air.Temp, Transect [4]
# Location Year Month Visit Air.Temp Transect Distance Present
# <fct> <int> <fct> <int> <dbl> <fct> <int> <dbl>
# 1 Site1 2015 Feb 1 21.5 C 20 0
# 2 Site1 2015 Feb 1 21.5 C 40 0
# 3 Site1 2015 Feb 1 21.5 C 60 0
# 4 Site1 2015 Feb 1 21.5 C 80 0
# 5 Site1 2015 Feb 1 21.5 C 100 0
# 6 Site1 2015 Feb 1 22.5 A 20 1
# 7 Site1 2015 Feb 1 22.5 A 40 1
# 8 Site1 2015 Feb 1 22.5 A 60 0
# 9 Site1 2015 Feb 1 22.5 A 80 1
#10 Site1 2015 Feb 1 22.5 A 100 0
#11 Site1 2015 Feb 1 23 B 20 1
#12 Site1 2015 Feb 1 23 B 40 0
#13 Site1 2015 Feb 1 23 B 60 0
#14 Site1 2015 Feb 1 23 B 80 0
#15 Site1 2015 Feb 1 23 B 100 0
#16 Site1 2015 Feb 2 24 A 20 0
#17 Site1 2015 Feb 2 24 A 40 0
#18 Site1 2015 Feb 2 24 A 60 0
#19 Site1 2015 Feb 2 24 A 80 1
#20 Site1 2015 Feb 2 24 A 100 0
选项2的缺点是我们需要将Distance
转换为factor
,然后再转换回integer
(或numeric
)。