Question

我有一个包含2461个观测值和从BOLD检索的80个变量的数据框。

Scleractinia <- read_tsv("http://www.boldsystems.org/index.php/API_Public/combined?taxon=Scleractinia&format=tsv")

目前，我正在对此数据框进行过滤。目前，我已经通过“标记代码”和“核苷酸”过滤了数据框。我想通过仅保留具有5条以上记录的“ species_name”来进一步过滤数据框。

Scleractinia.COI5P <- Scleractinia %>%
  filter(markercode == "COI-5P") %>%
  filter(str_detect(nucleotides, "[ACGT]"))
#This is a subset of the main dataset that includes only records with the marker code "COI-5P" and nucleotide sequences.

unique(Scleractinia.COI5P$species_name)
#There are 479 unique species present in this dataset. This is too many to work with so we are going to filter out species that don't have more than 5 records. 

SpeciesCount <- table(Scleractinia.COI5P$species_name)
#This creates a table of species and the number of records available in the dataset for this species.

我创建了“ SpeciesCount”来确定5个记录阈值，因为有很多物种只有1个记录。我不知道如何处理过滤后的Scleractinia.COI5P，这样80个变量（即列）仍然可用。

我尝试过：

test <- Scleractinia.COI5P %>%
  filter(table(Scleractinia.COI5P$species_name) > 5)

但是这导致0个观测值包含80个变量。本质上，我希望保留80个变量，以便我可以进一步探究需要过滤掉的内容，但我希望在Scleractinia.COI5P中只拥有大于或等于5个记录的物种。

Answer 1

使用dplyr，您只需要稍微更改管道操作即可。按物种名称分组，然后过滤

library(tidyverse)

##Filter first
Scleractinia.COI5P <- Scleractinia %>%
  filter(markercode == "COI-5P") %>%
  filter(str_detect(nucleotides, "[ACGT]"))


##Group by and then filter
filtered_data_frame <- Scleractinia.COI5P %>% 
                       group_by(species_name) %>% filter(n() >=5)

##check to see if only species with over 5 records are represented
total_species <- count(filtered_data_frame, sort = TRUE)

通过具有多个变量的数据框中的记录数过滤变量

1 个答案: