Question

我想在我的数据集中搜索那些来自多列的某些属性的值。

为此，我发现我可以像这样使用grep：

df <- read.csv('example.csv', header = TRUE, sep='\t')

df[grep("region+druggable", df$locus_type=="region", df$drug_binary==1),]

但是当我运行它时，我的输出是不同的列名。为什么会这样？

my dataframe is like this:
id    locus_type     drug_binary
1     pseudogene     1    
2     unknown        0
3     region         1
4     region         0
5     phenotype_only 1
6     region         1
...

理想情况下，我希望通过查询获得第3行和第6行。

Answer 1

如果您想使用基本R，正确的语法如下：

df[grepl("region|druggable",df$locus_type) & df$drug_binary==1,]

它给出了以下输出：

  id locus_type drug_binary
3  3     region           1
6  6     region           1

由于您想要组合逻辑向量，您需要使用具有逻辑输出的grepl。另外我假设你想要检查等于region或druggable的 locus type ，grepl中正则表达式的正确逻辑就是我上面使用的那个。

Answer 2

我喜欢dplyr的可读性

library(dplyr)

subdf <- filter(df, locus_type=="region", drug_binary==1)

Answer 3

有时使用sqldf库会很有帮助。

?sqldf

SQL选择数据框

描述

SQL选择数据框

这就是你如何得到你需要的结果：

# load the sqldf library
# if you get error "Error in library(sqldf) : there is no package called sqldf"
# you can install it simply by typing
# install.packages('sqldf')   <-- please notice the quotes!
library(sqldf)

# load your input dataframe
input.dataframe <- read.csv('/tmp/data.csv', stringsAsFactors = F)

# of course it's a data.frame
class(input.dataframe)

# express your query in SQL terms
sql_statement <- "select * from mydf where locus_type='region' and drug_binary=1"

# create a new data.frame as output of a select statement
# please notice how the "mydf" data.frame automagically becomes a valid sqlite table
output.dataframe <- sqldf(sql_statement)

# the output of a sqldf 'select' statement is a data.frame, too
class(output.dataframe)

# print your output df
output.dataframe

id locus_type drug_binary
 3     region           1
 6     region           1

选择R中具有其他列R的某些属性的所有行

3 个答案: