我需要使用关键字列表提取大型数据集的子集。 此处显示的大型数据集(gene_infoNCBI)包含关键字
> head(gene_infoNCBI)
X.tax_id GeneID Symbol LocusTag Synonyms dbXrefs chromosome map_location
1 7 5692769 NEWENTRY - - - - -
2 9 1246500 At1g00930 pLeuDn_01 - - - -
3 9 1246501 repA2 At1g13580 - - - -
4 9 1246502 leuA pLeuDn_04 - - - -
5 9 1246503 leuB pLeuDn_05 - - - -
6 9 1246504 leuC pLeuDn_06 - - - -
description
1 Record to support submission of GeneRIFs for a gene not in Gene (Azotirhizobium caulinodans. Use when strain, subtype, isolate, etc. is unspecified, or when different from all specified ones in Gene.).
2 putative replication-associated protein
3 putative replication-associated protein
4 2-isopropylmalate synthase
5 3-isopropylmalate dehydrogenase
6 isopropylmalate isomerase large subunit
type_of_gene Symbol_from_nomenclature_authority Full_name_from_nomenclature_authority
1 other - -
2 protein-coding - -
3 protein-coding - -
4 protein-coding - -
5 protein-coding - -
6 protein-coding - -
Nomenclature_status Other_designations Modification_date Feature_type
1 - - 20190202 -
2 - - 20180129 -
3 - - 20180129 -
4 - - 20180129 -
5 - - 20180129 -
6 - - 20180129 -
keyword.txt具有与gene_infoNCBI文件的“ Symbol”和“ LocusTag”值的列值匹配的关键字。
1 At1g00930 NA NA
2 At1g00930 NA NA
3 At1g00930 NA NA
4 At1g00930 NA NA
5 At1g00930 NA NA
6 At1g13580 NA NA
答案 0 :(得分:1)
此处操作不多,但是您可以执行以下操作:
library(tidyverse)
set.seed(10)
keywords <- c("a", "c", "d", "e", "f")
key_vec <- str_c(keywords, collapse = "|")
dat <- tibble(z = seq(1, 100, 1),
y = runif(100, 0, 50),
x = sample(letters, 100, replace = T))
dat %>%
filter(str_detect(x, key_vec))