我有一个包含2461个观测值和从BOLD检索的80个变量的数据框。
Scleractinia <- read_tsv("http://www.boldsystems.org/index.php/API_Public/combined?taxon=Scleractinia&format=tsv")
目前,我正在对此数据框进行过滤。目前,我已经通过“标记代码”和“核苷酸”过滤了数据框。我想通过仅保留具有5条以上记录的“ species_name”来进一步过滤数据框。
Scleractinia.COI5P <- Scleractinia %>%
filter(markercode == "COI-5P") %>%
filter(str_detect(nucleotides, "[ACGT]"))
#This is a subset of the main dataset that includes only records with the marker code "COI-5P" and nucleotide sequences.
unique(Scleractinia.COI5P$species_name)
#There are 479 unique species present in this dataset. This is too many to work with so we are going to filter out species that don't have more than 5 records.
SpeciesCount <- table(Scleractinia.COI5P$species_name)
#This creates a table of species and the number of records available in the dataset for this species.
我创建了“ SpeciesCount”来确定5个记录阈值,因为有很多物种只有1个记录。我不知道如何处理过滤后的Scleractinia.COI5P,这样80个变量(即列)仍然可用。
我尝试过:
test <- Scleractinia.COI5P %>%
filter(table(Scleractinia.COI5P$species_name) > 5)
但是这导致0个观测值包含80个变量。本质上,我希望保留80个变量,以便我可以进一步探究需要过滤掉的内容,但我希望在Scleractinia.COI5P中只拥有大于或等于5个记录的物种。
答案 0 :(得分:0)
使用dplyr,您只需要稍微更改管道操作即可。按物种名称分组,然后过滤
library(tidyverse)
##Filter first
Scleractinia.COI5P <- Scleractinia %>%
filter(markercode == "COI-5P") %>%
filter(str_detect(nucleotides, "[ACGT]"))
##Group by and then filter
filtered_data_frame <- Scleractinia.COI5P %>%
group_by(species_name) %>% filter(n() >=5)
##check to see if only species with over 5 records are represented
total_species <- count(filtered_data_frame, sort = TRUE)