我每隔五天收集一份关于植物发育或物候学的数据(使用分类变量'Code'编码),沿着横切面划分为78个连续的片段。每个物种都在每个区段的横断面上进行调查。这项努力正在重复100年前的一项研究!
我想重新编码我的数据集,以克服原始研究编码系统的不足。
原始编码系统(用于植物开花期):
K = flower bud
b1 = single flower
b2 = sparse flowers (two or three)
b3 = flowers common (more than three)
B4 = flowering ended
问题在于,当我想分析我的数据时,这些代码不足以描述观察的背景。例如,代码'b1'和'b2'可以在开花期的早期和晚期发生。这使得难以以标准化方式“排列”我的观察结果。
解决方案可以是循环或其他有效的方式来顺序移动观察(通过'Segment','Species','Date')来重新编码观察,基于它是在特定事件之前还是之后发生(在这种情况下,第一次'Code'被记录为“b3”)。
对于横断面和物种的任何给定区段,原始数据中的代码可能如下所示:
Date Segment Species Code
26/05/2017 1 A K
01/06/2017 1 A b1
06/06/2017 1 A b1
10/06/2017 1 A b2
14/06/2017 1 A b2
19/06/2017 1 A b2
23/06/2017 1 A b3
28/06/2017 1 A b3
03/07/2017 1 A b2
08/07/2017 1 A b2
14/07/2017 1 A b1
19/07/2017 1 A b4
如果我考虑在赛季前使用数据,我会使用类似的编码系统:
K = flower bud
b1a = single flower
b2a = sparse flowers (two or three)
b3 = flowers common (more than three)
b2b = sparse flowers (two or three)
b1b = single flower
B4 = flowering ended
通过对代码的这些更改,上面的示例数据将如下所示:
Date Segment Species Code
26/05/2017 1 A K
01/06/2017 1 A b1a
06/06/2017 1 A b1a
10/06/2017 1 A b2a
14/06/2017 1 A b2a
19/06/2017 1 A b2a
23/06/2017 1 A b3
28/06/2017 1 A b3
03/07/2017 1 A b2b
08/07/2017 1 A b2b
14/07/2017 1 A b1b
19/07/2017 1 A b4
此外,我必须重新编码历史数据集,因此任何解决方案对两者都至关重要。
注意:非常重要的是,在第一次时间之后,在“b1”或“b2”附加“a”到“b”后切换遇到'b3'。这很重要,因为有时花的数据丰度在生长季节会波动。例如:
Date Segment Species Code
01-Jun-17 1 A b1
06-Jun-17 1 A b1
10-Jun-17 1 A b2
14-Jun-17 1 A b2
19-Jun-17 1 A b3
23-Jun-17 1 A b3
28-Jun-17 1 A b2 # appears out of the "ideal" sequence
02-Aug-17 1 A b3
07-Aug-17 1 A b2 # appears out of the "ideal" sequence
12-Aug-17 1 A b3
17-Aug-17 1 A b2
22-Aug-17 1 A b1 # appears out of the "ideal" sequence
27-Aug-17 1 A b2
02-Sep-17 1 A b1
07-Sep-17 1 A b4
在这种情况下,数据看起来像:
Date Segment Species Code
01-Jun-17 1 A b1a
06-Jun-17 1 A b1a
10-Jun-17 1 A b2a
14-Jun-17 1 A b2a
19-Jun-17 1 A b3
23-Jun-17 1 A b3
28-Jun-17 1 A b2b
02-Aug-17 1 A b3
07-Aug-17 1 A b2b
12-Aug-17 1 A b3
17-Aug-17 1 A b2b
22-Aug-17 1 A b1b
27-Aug-17 1 A b2b
02-Sep-17 1 A b1b
07-Sep-17 1 A b4
最后一点。由于北极地区的生长季节很短,并不是每个开花期(=代码)都发生在一个区域内的每个物种。
示例数据:
DT <- structure(list(Date = structure(c(17312, 17318, 17323, 17327,
17331, 17336, 17340, 17345, 17350, 17355, 17361, 17366, 17312,
17318, 17323, 17327, 17331, 17336, 17340, 17345, 17350, 17355,
17361, 17366, 17370, 17375, 17350, 17355, 17361, 17366, 17370,
17312, 17318, 17323, 17327, 17331, 17336, 17340, 17345, 17350,
17355, 17361, 17366, 17312, 17318, 17323, 17327, 17331, 17336,
17340, 17345, 17350, 17355, 17361, 17366, 17355, 17361, 17366,
17370, 17375, 17318, 17323, 17327, 17331, 17336, 17340, 17345,
17380, 17385, 17390, 17395, 17400, 17405, 17411, 17416, 17318,
17323, 17327, 17331, 17336, 17340, 17345, 17380, 17385, 17390,
17395, 17400, 17405, 17411, 17416, 17318, 17323, 17327, 17331,
17336, 17340, 17345, 17380, 17385, 17390, 17395, 17400, 17405,
17411, 17416), class = "Date"), Segment = c(1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4), Species = c("A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "C",
"C", "C", "C", "C", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "C", "C", "C", "C", "C", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A"
), Code = c("K", "b1", "b1", "b2", "b2", "b2", "b3", "b3", "b2",
"b2", "b1", "b4", "b1", "b1", "b2", "b2", "b2", "b3", "b3", "b3",
"b2", "b2", "b2", "b1", "b1", "b4", "b1", "b1", "b2", "b2", "b4",
"b1", "b1", "b2", "b2", "b2", "b3", "b3", "b3", "b2", "b2", "b2",
"b4", "K", "b1", "b1", "b2", "b2", "b2", "b3", "b3", "b2", "b2",
"b2", "b4", "b3", "b3", "b2", "b1", "b4", "b1", "b1", "b2", "b2",
"b3", "b3", "b2", "b3", "b2", "b3", "b2", "b1", "b2", "b1", "b4",
"b1", "b1", "b2", "b2", "b3", "b3", "b2", "b3", "b2", "b3", "b2",
"b1", "b2", "b1", "b4", "b1", "b1", "b2", "b2", "b3", "b3", "b2",
"b3", "b2", "b3", "b2", "b1", "b2", "b1", "b4")), .Names = c("Date",
"Segment", "Species", "Code"), row.names = c(NA, -105L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x00000000000b0788>)
答案 0 :(得分:2)
使用dplyr
,可以通过以下方式完成此操作:
library(dplyr)
DT %>%
group_by(Species, Segment) %>%
mutate(after_b3 = (cumsum(Code == "b3") > 0),
Code_new = case_when(Code %in% c("b1", "b2") & !after_b3 ~ paste0(Code, "a"),
Code %in% c("b1", "b2") & after_b3 ~ paste0(Code, "b"),
TRUE ~ Code))
# A tibble: 105 x 6
# Groups: Segment, Species [9]
# Date Segment Species Code after_b3 Code_new
# <date> <dbl> <chr> <chr> <lgl> <chr>
# 1 2017-05-26 1 A K FALSE K
# 2 2017-06-01 1 A b1 FALSE b1a
# 3 2017-06-06 1 A b1 FALSE b1a
# 4 2017-06-10 1 A b2 FALSE b2a
# 5 2017-06-14 1 A b2 FALSE b2a
# 6 2017-06-19 1 A b2 FALSE b2a
# 7 2017-06-23 1 A b3 TRUE b3
# 8 2017-06-28 1 A b3 TRUE b3
# 9 2017-07-03 1 A b2 TRUE b2b
# 10 2017-07-08 1 A b2 TRUE b2b
# ... with 95 more rows
使用group_by
代码将应用于每个Segment,Species组合。 after_b3
列描述了Code
是否已经"b3"
一次。然后通过检查几个案例来确定Code_new
。
答案 1 :(得分:0)
也许不是最有效的方式,但它有效(考虑到我理解你的问题)
library(data.table)
DT <- as.data.table(DT)
tmp_list <- list()
for (seg in unique(DT$Segment)){ # seg <- 1
for(spec in unique(DT$Species)){ # spec <- "C"
tmp_list[[paste0(seg,"_",spec)]] <- DT[Segment%in%seg & Species%in%spec]
index <- which(tmp_list[[paste0(seg,"_",spec)]]$Code=="b3")[1]
rows <- nrow(tmp_list[[paste0(seg,"_",spec)]])
if(!is.na(index)){
tmp_list[[paste0(seg,"_",spec)]][index:rows,new_code:=ifelse(Code%in%"b1","b1b",
ifelse(Code%in%"b2","b2b",Code))]
tmp_list[[paste0(seg,"_",spec)]][1:index,new_code:=ifelse(Code%in%"b1","b1a",
ifelse(Code%in%"b2","b2a",Code))]
}else{
tmp_list[[paste0(seg,"_",spec)]][,new_code:=new_code:=ifelse(Code%in%"b1","b1a",
ifelse(Code%in%"b2","b2a",Code))]
}
}
}
final <- rbindlist(tmp_list)
因此,根据细分和物种,我会找到第一个b3
,然后找到(之后我的意思是下一行)我会更改所有b1
和b2
分别为b1b
和b2b
。对于第一个b3
之前的行,我将b1
和b2
分别更改为b1a
和b2a
。 if语句考虑了特定物种段组合没有b3