R如何删除序列

时间:2018-01-29 08:36:58

标签: r dataframe datatable

正如之前的question所述,我每隔五天收集一份植物发育或物候学数据(使用分类变量和编码编码),沿着横断面划分为78个连续的片段。在每个区段的横断面上对每个物种进行调查。

我在收集数据时没有考虑的另一个问题是,有时候观察者可能会错过现场的观察结果,影响他们选择的代码,或者他们只是输错了。具体来说,他们使用的代码是:

b1 = single flower
b2 = sparse flowers (two or three)
b3 = flowers common (more than three)
B4 = flowering ended

随着时间推移的预期(简化)观察序列看起来像' b1',' b2',' b3',' b2&# 39;,' b1',' b4'。请注意,可能存在多个具有相同观察结果的样本日期,因此数据可能看起来像“b1'”,“b1'”,“' b2'”,#39; b3',' b3',' b2',' b2',' b2',' b1', ' b1',' b4'。

不幸的是,我发现了许多序列看起来像

的例子
Date    Segment Species Code
01-Jun-17   1   A   b1
06-Jun-17   1   A   b1
10-Jun-17   1   A   b2
14-Jun-17   1   A   b2
19-Jun-17   1   A   b3
23-Jun-17   1   A   b3
28-Jun-17   1   A   b2 # out of sequence - assume it should be b3
02-Aug-17   1   A   b3
07-Aug-17   1   A   b2 # out of sequence - assume it should be b3
12-Aug-17   1   A   b3
17-Aug-17   1   A   b2
22-Aug-17   1   A   b1 # out of sequence - assume it should be b2
27-Aug-17   1   A   b2 
02-Sep-17   1   A   b1
07-Sep-17   1   A   b4

应该看起来像:

Date    Segment Species Code
01-Jun-17   1   A   b1
06-Jun-17   1   A   b1
10-Jun-17   1   A   b2
14-Jun-17   1   A   b2
19-Jun-17   1   A   b3
23-Jun-17   1   A   b3
28-Jun-17   1   A   b3
02-Aug-17   1   A   b3
07-Aug-17   1   A   b3
12-Aug-17   1   A   b3
17-Aug-17   1   A   b2
22-Aug-17   1   A   b2
27-Aug-17   1   A   b2
02-Sep-17   1   A   b1
07-Sep-17   1   A   b4

更强烈的诚实方法是丢弃第一个不按顺序的值,假设我们无法知道观察者是否错过观察开花植物或者数据集上的拼写错误。那么,每次出现序列错误时,如何删除不在序列中的第一个值?在这种情况下,数据集看起来像:

Date    Segment Species Code
01-Jun-17   1   A   b1
06-Jun-17   1   A   b1
10-Jun-17   1   A   b2
14-Jun-17   1   A   b2
19-Jun-17   1   A   b3
23-Jun-17   1   A   b3
02-Aug-17   1   A   b3
12-Aug-17   1   A   b3
17-Aug-17   1   A   b2
22-Aug-17   1   A   b2
02-Sep-17   1   A   b1
07-Sep-17   1   A   b4

以下是示例代码:

Test.Data <- structure(list(Date = structure(c(17318, 17323, 17327,
17331, 17336, 17340, 17345, 17380, 17385, 17390, 17395, 17400, 17405, 
17411, 17416, 17318, 17323, 17327, 17331, 17336, 17340, 17345, 
17380, 17385, 17390, 17395, 17400, 17405, 17411, 17416, 17318, 
17323, 17327, 17331, 17336, 17340, 17345, 17380, 17385, 17390, 
17395, 17400, 17405, 17411, 17416), class = "Date"), Segment = c(1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2), Species = c("A", "A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A"), Code = c("b1", 
"b1", "b2", "b2", "b3", "b3", "b2", "b3", "b2", "b3", "b2", "b1", 
"b2", "b1", "b4", "b1", "b1", "b2", "b2", "b3", "b3", "b2", "b3", 
"b2", "b3", "b2", "b1", "b2", "b1", "b4", "b1", "b1", "b2", "b2", 
"b3", "b3", "b2", "b3", "b2", "b3", "b2", "b1", "b2", "b1", "b4"
)), .Names = c("Date", "Segment", "Species", "Code"), row.names = c(NA, 
-45L), class = "data.frame")

当然假设是第一次观察特定物种的植物开花事件(即&#39; b1&#39;,&#39; b2&#39;,& #39; b3&#39;,&#39; b4&#39;)是正确的!

注意:这个问题反映了我想要重新编码我的数据集以克服原始研究编码系统的不足(见question)。如果我考虑在赛季前使用数据,我会使用类似的编码系统:

b1a = single flower
b2a = sparse flowers (two or three)
b3 = flowers common (more than three)
b2b = sparse flowers (two or three)
b1b = single flower
B4 = flowering ended

无论如何,我仍然需要克服历史数据集的编码问题!

1 个答案:

答案 0 :(得分:1)

这种可能性依赖于cummax

# extract numbers from 'Code', except the last which I assume always is 4
x <- as.numeric(substring(d$Code[-length(d$Code)], 2))

# find index of first max
ix <- which.max(x == max(x))

# find cumulative max on
# (1) x from index 1 to ix
# (2) x from end to index ix + 1
# reverse (2)
# concatenate (1), (2) and a 4
d$Code2 <- c(cummax(x[1:ix]), rev(cummax(x[length(x):(ix + 1)])), 4)

d[ , c("Code", "Code2")]
   Code Code2
1    b1     1
2    b1     1
3    b2     2
4    b2     2
5    b3     3
6    b3     3
7    b2     3
8    b3     3
9    b2     3
10   b3     3
11   b2     2
12   b1     2
13   b2     2
14   b1     1
15   b4     4

要通过'细分'和'种类'执行此操作,您可以使用例如data.table

library(data.table)
setDT(Test.Data)
Test.Data[ , Code2 := {
  x = as.numeric(substring(Code[-.N], 2))
  ix = which.max(x == max(x))
  .(paste0("b", c(cummax(x[1:ix]), rev(cummax(x[length(x):(ix + 1)])), 4)))
},
by = .(Segment, Species)]