Question

我有一个具有以下结构的数据集

     site    block treatment date insect1 insect2 insect3 insect4 ...
1  location1     a  chemical1 date1  0     0      10       1          
2  location1     a  chemical2 date1  1     0       2       0   
3  location1     a  chemical3 date1  0     0      23       1   
4  location1     a  chemical4 date1  0     0       5       0   
5  location1     a  chemical5 date1  0     0       9       0   
6  location1     b  chemical1 date1  0     1       5       0   
7  location1     b  chemical2 date1  1     0       5       1   
8  location1     b  chemical3 date1  0     0       4       0   
9  location1     b  chemical4 date1  0     0       5       0   
10 location1     b  chemical5 date1  3     0      12       0   
11 location1     c  chemical1 date1  0     0       2       1   
12 location1     c  chemical2 date1  0     0       0       0   
13 location1     c  chemical3 date1  0     0       4       0   
14 location1     c  chemical4 date1  0     0       2       7   
15 location1     c  chemical5 date1  2     0       5       0   
16 location1     d  chemical1 date1  0     0       8       1   
17 location1     d  chemical2 date1  0     0       3       0   
18 location1     d  chemical3 date1  0     0      10       0   
19 location1     d  chemical4 date1  0     0       2       0   
20 location1     d  chemical5 date1  0     1       7       0
       .         .     .        .    .     .       .       .   
       .         .     .        .    .     .       .       .   
       .         .     .        .    .     .       .       .

该数据集是我进行的一项实验的结果，该实验是我测试了五种不同化学处理方法（化学品1-5）对田间站点（此处为昆虫1-4）上许多不同物种的昆虫（ location1）。该实验在该野外站点的不同位置被阻止（a-d）4次，并在不同的日期重复了5次（仅显示了date1）。所有这些信息都存储在数据集的前四列中。

下一组列（我有46列，但我仅显示4列）表示昆虫的不同种类，以及我在每种处理x块x日期组合内用特定化学物质捕获的昆虫数量。每行）。

作为分析的一部分，我想遍历此数据集，并为未捕获任何昆虫的每种昆虫找到块x日期的组合。例如，我没有在date1的a或c区块中捕获任何昆虫2个体，因此我希望将其从最终数据集中删除以进行分析。

我花了很多时间来完成代码来完成此任务，但是昨晚我发现我的代码没有按照我的想象工作，因此我不知所措地试图解决它。到目前为止，这是代码（我已经包括解决问题的所有步骤，因此人们可以看到可能在哪里引入了该问题，或者提出了一种更好的解决方法...）：

创建一个列表，使每种昆虫（此处为5-8列）都有自己的数据框

sticky.list = lapply(sticky[-c(1:4,50)], function(i)data.frame(site=sticky$site, 
                                                          block=sticky$block,
                                                          treatment=sticky$treatment,
                                                          date=sticky$date,
                                                          number=as.numeric(i)))

作为我的列表的一部分创建的数据框之一的示例

$insect1
       site    block     treatment date     number
1  location1     a       chemical1 date1      0
2  location1     a       chemical2 date1      1
3  location1     a       chemical3 date1      0
4  location1     a       chemical4 date1      0
5  location1     a       chemical5 date1      0

然后在列表中的每个数据框中添加具有数据框名称（即昆虫名称）的新列

temp.list = Map(cbind, sticky.list, morphotype = names(sticky.list))  

       site    block   treatment date     number morphotype
1  location1     a     chemical1 date1      0      insect1
2  location1     a     chemical2 date1      1      insect1      
3  location1     a     chemical3 date1      0      insect1
4  location1     a     chemical4 date1      0      insect1
5  location1     a     chemical5 date1      0      insect1

通过垂直组合来制作更大的数据集，然后展平每个列表元素（即制作一个大数据框。这会将我之前列表中的所有数据框都放到一个数据框中。

sticky.list.combined.df <- temp.list %>% bind_rows(temp.list) %>% # make larger sample data
  mutate_if(is.list, simplify_all) %>% # flatten each list element internally 
  unnest()

按块和词型进行分组，并根据此分组找到数字的总和。然后，使用内部联接将此总和列添加到我们刚刚创建的主要大型数据框中，即sticky.list.combined.df。

sticky.list.combined.df.sum<- sticky.list.combined.df %>%
  group_by(date, block, morphotype) %>%
  summarize(sum = sum(number))

# A tibble: 855 x 4
# Groups:   date, block [?]
   date            block morphotype    sum
   <fct>           <fct> <chr>       <dbl>
 1 date1 a     insect1     0
 2 date1 a     insect2     0
 3 date1 a     insect3     0
 4 date1 a     insect4     0
# … with 845 more rows

然后

sticky.list.analysis<-left_join(sticky.list.combined.df,sticky.list.combined.df.sum, by=c("date"="date",
                                                                                          "morphotype"="morphotype"))

这是仅显示昆虫1的输出示例。是否为每个block.x保留5行的决定性因素是最后两列，即block.y和sum，它们表示针对每个block（ad）针对化学药品1-5捕获的所有昆虫的总和。

      site       block.x    treatment date    number     morphotype block.y sum
1   location1       a       chemical1 date1      0         insect1       a   2
2   location1       a       chemical1 date1      0         insect1       b   8
3   location1       a       chemical1 date1      0         insect1       c   4
4   location1       a       chemical1 date1      0         insect1       d   0
5   location1       a       chemical2 date1      0         insect1       a   2
6   location1       a       chemical2 date1      0         insect1       b   8
7   location1       a       chemical2 date1      0         insect1       c   4
8   location1       a       chemical2 date1      0         insect1       d   0
9   location1       a       chemical3 date1      0         insect1       a   2
10  location1       a       chemical3 date1      0         insect1       b   8
11  location1       a       chemical3 date1      0         insect1       c   4
12  location1       a       chemical3 date1      0         insect1       d   0
13  location1       a       chemical4 date1      0         insect1       a   2
14  location1       a       chemical4 date1      0         insect1       b   8
15  location1       a       chemical4 date1      0         insect1       c   4
16  location1       a       chemical4 date1      0         insect1       d   0
17  location1       a       chemical5 date1      0         insect1       a   2
18  location1       a       chemical5 date1      0         insect1       b   8
19  location1       a       chemical5 date1      0         insect1       c   4
20  location1       a       chemical5 date1      0         insect1       d   0

我认为这是我遇到的问题

过滤总和> 0的行。

对于捕获日期（例如date1）和词型的每种组合，请删除在该块中捕获了零个词型的行（即块a-d）。在诱捕实验中（在Hanks实验室的统计实践中很常见），通常会丢弃或不包含没有捕获目标昆虫的日期。这可能与非生物因素（例如，太冷/太热，下雨）或与昆虫相关的物候因素有关。在数据中保留这些零会减少我们在数据中发现重大影响的机会，因此我们将其排除在外。

sticky.list.analysis.reduced<- sticky.list.analysis %>% 
  filter(sum > 0)

下面缩短的输出表明，对于worm1，我们应保留a-c块。保留哪些块将根据正在查看的昆虫而有所不同。我现在要做的是从block.y中获取这些数据，并使用它来删除那些块的行。

不幸的是，这不是我想要的输出。 R在sum列的基础上放了一行。现在我们看到，根据block.y列删除了块d。不幸的是，我们需要删除46-60行。

输出：

       site block.x treatment date number morphotype block.y sum
1    location1   a    chemical1 date1   0      insect1    a   2
2    location1   a    chemical1 date1   0      insect1    b   8
3    location1   a    chemical1 date1   0      insect1    c   4
4    location1   a    chemical2 date1   0      insect1    a   2
5    location1   a    chemical2 date1   0      insect1    b   8
6    location1   a    chemical2 date1   0      insect1    c   4
7    location1   a    chemical3 date1   0      insect1    a   2
8    location1   a    chemical3 date1   0      insect1    b   8
9    location1   a    chemical3 date1   0      insect1    c   4
10   location1   a    chemical4 date1   0      insect1    a   2
11   location1   a    chemical4 date1   0      insect1    b   8
12   location1   a    chemical4 date1   0      insect1    c   4
13   location1   a    chemical5 date1   0      insect1    a   2
14   location1   a    chemical5 date1   0      insect1    b   8
15   location1   a    chemical5 date1   0      insect1    c   4
16   location1   b    chemical1 date1   0      insect1    a   2
17   location1   b    chemical1 date1   0      insect1    b   8
18   location1   b    chemical1 date1   0      insect1    c   4
19   location1   b    chemical2 date1   0      insect1    a   2
20   location1   b    chemical2 date1   0      insect1    b   8
21   location1   b    chemical2 date1   0      insect1    c   4
22   location1   b    chemical3 date1   0      insect1    a   2
23   location1   b    chemical3 date1   0      insect1    b   8
24   location1   b    chemical3 date1   0      insect1    c   4
25   location1   b    chemical4 date1   0      insect1    a   2
26   location1   b    chemical4 date1   0      insect1    b   8
27   location1   b    chemical4 date1   0      insect1    c   4
28   location1   b    chemical5 date1   0      insect1    a   2
29   location1   b    chemical5 date1   0      insect1    b   8
30   location1   b    chemical5 date1   0      insect1    c   4
31   location1   c    chemical1 date1   0      insect1    a   2
32   location1   c    chemical1 date1   0      insect1    b   8
33   location1   c    chemical1 date1   0      insect1    c   4
34   location1   c    chemical2 date1   0      insect1    a   2
35   location1   c    chemical2 date1   0      insect1    b   8
36   location1   c    chemical2 date1   0      insect1    c   4
37   location1   c    chemical3 date1   0      insect1    a   2
38   location1   c    chemical3 date1   0      insect1    b   8
39   location1   c    chemical3 date1   0      insect1    c   4
40   location1   c    chemical4 date1   0      insect1    a   2
41   location1   c    chemical4 date1   0      insect1    b   8
42   location1   c    chemical4 date1   0      insect1    c   4
43   location1   c    chemical5 date1   0      insect1    a   2
44   location1   c    chemical5 date1   0      insect1    b   8
45   location1   c    chemical5 date1   0      insect1    c   4
46   location1   d    chemical1 date1   0      insect1    a   2
47   location1   d    chemical1 date1   0      insect1    b   8
48   location1   d    chemical1 date1   0      insect1    c   4
49   location1   d    chemical2 date1   0      insect1    a   2
50   location1   d    chemical2 date1   0      insect1    b   8
51   location1   d    chemical2 date1   0      insect1    c   4
52   location1   d    chemical3 date1   0      insect1    a   2
53   location1   d    chemical3 date1   0      insect1    b   8
54   location1   d    chemical3 date1   0      insect1    c   4
55   location1   d    chemical4 date1   0      insect1    a   2
56   location1   d    chemical4 date1   0      insect1    b   8
57   location1   d    chemical4 date1   0      insect1    c   4
58   location1   d    chemical5 date1   0      insect1    a   2
59   location1   d    chemical5 date1   0      insect1    b   8
60   location1   d    chemical5 date1   0      insect1    c   4

所需的输出：

       site block.x treatment date number morphotype block.y sum
1    location1   a    chemical1 date1   0      insect1    a   2
2    location1   a    chemical1 date1   0      insect1    b   8
3    location1   a    chemical1 date1   0      insect1    c   4
4    location1   a    chemical2 date1   0      insect1    a   2
5    location1   a    chemical2 date1   0      insect1    b   8
6    location1   a    chemical2 date1   0      insect1    c   4
7    location1   a    chemical3 date1   0      insect1    a   2
8    location1   a    chemical3 date1   0      insect1    b   8
9    location1   a    chemical3 date1   0      insect1    c   4
10   location1   a    chemical4 date1   0      insect1    a   2
11   location1   a    chemical4 date1   0      insect1    b   8
12   location1   a    chemical4 date1   0      insect1    c   4
13   location1   a    chemical5 date1   0      insect1    a   2
14   location1   a    chemical5 date1   0      insect1    b   8
15   location1   a    chemical5 date1   0      insect1    c   4
16   location1   b    chemical1 date1   0      insect1    a   2
17   location1   b    chemical1 date1   0      insect1    b   8
18   location1   b    chemical1 date1   0      insect1    c   4
19   location1   b    chemical2 date1   0      insect1    a   2
20   location1   b    chemical2 date1   0      insect1    b   8
21   location1   b    chemical2 date1   0      insect1    c   4
22   location1   b    chemical3 date1   0      insect1    a   2
23   location1   b    chemical3 date1   0      insect1    b   8
24   location1   b    chemical3 date1   0      insect1    c   4
25   location1   b    chemical4 date1   0      insect1    a   2
26   location1   b    chemical4 date1   0      insect1    b   8
27   location1   b    chemical4 date1   0      insect1    c   4
28   location1   b    chemical5 date1   0      insect1    a   2
29   location1   b    chemical5 date1   0      insect1    b   8
30   location1   b    chemical5 date1   0      insect1    c   4
31   location1   c    chemical1 date1   0      insect1    a   2
32   location1   c    chemical1 date1   0      insect1    b   8
33   location1   c    chemical1 date1   0      insect1    c   4
34   location1   c    chemical2 date1   0      insect1    a   2
35   location1   c    chemical2 date1   0      insect1    b   8
36   location1   c    chemical2 date1   0      insect1    c   4
37   location1   c    chemical3 date1   0      insect1    a   2
38   location1   c    chemical3 date1   0      insect1    b   8
39   location1   c    chemical3 date1   0      insect1    c   4
40   location1   c    chemical4 date1   0      insect1    a   2
41   location1   c    chemical4 date1   0      insect1    b   8
42   location1   c    chemical4 date1   0      insect1    c   4
43   location1   c    chemical5 date1   0      insect1    a   2
44   location1   c    chemical5 date1   0      insect1    b   8
45   location1   c    chemical5 date1   0      insect1    c   4

解决此问题后，我想从其昆虫列中细分每种昆虫（我知道如何手动执行此操作，但不是针对所有昆虫种类，但这是一个完全不同的问题），然后运行广义线性混合模型评估治疗对捕获每种昆虫的效果，以日期和位置为随机效果。

感谢您对此事的见识。如果需要进行编辑以添加任何其他信息，请告知我，我已尽力使数据结构和问题清晰明了。谢谢。

Answer 1

您是否尝试过subset函数？在base R程序包（link）下定义。

您可以执行以下操作：

filtered.sticky.list.analysis <- subset(sticky.list.analysis, block.x == "a" || block.x == "b" || block.x == "c")

另一种可行的方法是：

filtered.sticky.list.analysis <- subset(sticky.list.analysis, block.x != "d")

代码很清楚。第一个选项选择block.x等于a，b或c的所有内容。第二个选项选择与d不同的所有内容。

通过两列中的数据子集

1 个答案: