Question

我需要使用关键字列表提取大型数据集的子集。此处显示的大型数据集（gene_infoNCBI）包含关键字

> head(gene_infoNCBI)
  X.tax_id  GeneID   Symbol  LocusTag Synonyms dbXrefs chromosome map_location
1        7 5692769 NEWENTRY         -        -       -          -            -
2        9 1246500    At1g00930 pLeuDn_01        -       -          -            -
3        9 1246501    repA2 At1g13580        -       -          -            -
4        9 1246502     leuA pLeuDn_04        -       -          -            -
5        9 1246503     leuB pLeuDn_05        -       -          -            -
6        9 1246504     leuC pLeuDn_06        -       -          -            -
                                                                                                                                                                                                 description
1 Record to support submission of GeneRIFs for a gene not in Gene (Azotirhizobium caulinodans.  Use when strain, subtype, isolate, etc. is unspecified, or when different from all specified ones in Gene.).
2                                                                                                                                                                    putative replication-associated protein
3                                                                                                                                                                    putative replication-associated protein
4                                                                                                                                                                                 2-isopropylmalate synthase
5                                                                                                                                                                            3-isopropylmalate dehydrogenase
6                                                                                                                                                                    isopropylmalate isomerase large subunit
    type_of_gene Symbol_from_nomenclature_authority Full_name_from_nomenclature_authority
1          other                                  -                                     -
2 protein-coding                                  -                                     -
3 protein-coding                                  -                                     -
4 protein-coding                                  -                                     -
5 protein-coding                                  -                                     -
6 protein-coding                                  -                                     -
  Nomenclature_status Other_designations Modification_date Feature_type
1                   -                  -          20190202            -
2                   -                  -          20180129            -
3                   -                  -          20180129            -
4                   -                  -          20180129            -
5                   -                  -          20180129            -
6                   -                  -          20180129            -

keyword.txt具有与gene_infoNCBI文件的“ Symbol”和“ LocusTag”值的列值匹配的关键字。

1              At1g00930          NA NA
2              At1g00930          NA NA
3              At1g00930          NA NA
4              At1g00930          NA NA
5              At1g00930          NA NA
6              At1g13580          NA NA

Answer 1

此处操作不多，但是您可以执行以下操作：

library(tidyverse)

set.seed(10)

keywords <- c("a", "c", "d", "e", "f")
key_vec <- str_c(keywords, collapse = "|")

dat <- tibble(z = seq(1, 100, 1), 
              y = runif(100, 0, 50), 
              x = sample(letters, 100, replace = T))

dat %>% 
  filter(str_detect(x, key_vec))

使用关键字列表提取完整行以形成数据表

1 个答案: