Question

我得到了一些DNA测序程序的结果，这些结果没有以特别有用的方式呈现。目前，第一列我的数据框具有物种名称，其余列包含所有板孔，其中检测到来自该物种的DNA。以下是数据表的示例：

Species                 V1      V2      V3      V4      V5
Eupeodes corollae       1-G3    1-F1    1-E11   1-C10   1-A3
Diptera                 1-A10   1-B2    1-C1    1-G7    1-E11
Episyrphus balteatus    2-C3    2-A10   1-C11   1-A10   2-B4
Aphidie                 1-B9    1-D7    2-A3    1-C8    2-C11
Ericaphis               1-B9    1-D7    2-A3    1-C8    2-C11
Hemiptera               1-B9    1-D7    2-A3    1-C8    2-C11

最后，我想得到一个数据框，其中第一列包含所有板孔，其余列包含每个孔中识别的所有物种，如下所示：

Well  Species1              Species2                Species3
1-A1  Eupeodes corollae     Ericaphis
1-A2  Episyrphus balteatus  
1-A3  Aphidie        
1-A4  Hemiptera             Episyrphus balteatus    Aphidie
1-A5  Diptera

我猜这将是一个两步程序，其中表格首先被重新整形为长格式，每个井种匹配的新记录，然后第二阶段，记录被合并，以便每个在第一列中只出现一次，并且在井中找到的所有物种都列在井名旁边。但是，我担心这种复杂的重塑超出了我在R中的能力。任何人都可以建议我如何去做这件事吗？

Answer 1

你的想法非常明显，并且有很多套装可以很快完成。

在tidyverse包中，您描述的操作分别封装在名为gather和spread的函数中。有一个非常酷的cheatsheet produced by R Studio，涵盖了这些类型的数据争论活动。

您的数据的技巧是，通常，传播期望有一组唯一的列。好消息是你可以通过以下两种方式解决这个问题：

<强> 1。为新的唯一列创建占位符变量，并使用占位符作为键进行传播

    library(tidyr)
    library(dplyr)

    output <- 
        input %>%
        # bring all of the data into a long table
        gather(Plate, Well, V1:V5) %>%
        # remove the column with the old column names,
        # this column will cause problems in spread if not removed
        select(-Plate) %>% 
        # create the placeholder variable
        group_by(Well) %>%
        mutate(NewColumn = seq(1, n())) %>%
        # spread the data out based on the new column header
        spread(NewColumn, Species)

根据用途以及是否需要，您可以在扩散功能之前或之后重命名标题列。

OR：

<强> 2。稍微改变所需的输出，每个物种给你一列

    library(tidyr)
    library(dplyr)

    output <- 
        input %>%
        # bring all of the data into a long table
        gather(Plate, Well, V1:V5) %>%
        # remove the column with the old column names,
        # this column will cause problems in spread if not removed
        select(-Plate) %>% 
        # count the number of unique combinations of well and species
        count(Well, Species) %>%
        # spread out the counts
        # fill = 0 sets the values where no combinations exist to 0
        spread(Species, n, fill = 0)

这为您提供了不同的输出，但我提到它是因为它可以更容易地查看是否存在同一数据集的多个实例（例如，两个相同的物种），并且可以很好地设置数据以供将来分析。

可复制的数据供参考：

input <- tibble(
    Species = c(
        "Eupeodes corollae",
        "Diptera",
        "Episyrphus balteatus",
        "Aphidie",
        "Ericaphis",
        "Hemiptera"
    ),
    V1 = c("1-G3 ", "1-A10", "2-C3", "1-B9", "1-B9", "1-B9"),
    V2 = c("1-F1", "1-B2", "2-A10", "1-D7", "1-D7", "1-D7"),
    V3 = c("1-E11", "1-C1" , "1-C11", "2-A3", "2-A3", "2-A3"),
    V4 = c("1-C10", "1-G7", "1-A10", "1-C8", "1-C8", "1-C8"),
    V5 = c("1-A3", "1-E11", "2-B4", "2-C11", "2-C11", "2-C11")
)

R

1 个答案: