Question

我有一个看起来像这样的数据集：

long_name x y short_name
Adhesion G protein-coupled receptor E2 (ADGRE2) 10 10 ADGRE2
Adhesion G-protein coupled receptor G2 (ADGRG2) 12 12 ADX2
ADM (ADM) 13 13 ADM
ADP-ribosyl cyclase/cyclic ADP-ribose hydrolase 1 (CD38) 14 14 ACH1

我想做的是创建一个附加列，该列将说明short_name的值是否在long_name的值中以产生TRUE / FALSE（或当前/不存在）值在新列中。

在这里，我看到了一些有关使用grepl函数在另一个字符串中寻找一个字符串的建议。我遇到的问题是尝试遍历整个文件。

我有类似的东西：

for (row in 1:length(nrows(combined_proteins))){

  long_name = proteins[1]
  short_name = proteins[4]

  if grepl(short_name, long_name) = TRUE{

   proteins$presence = "Present"

   else proteins$presence = "Not"
  }
}

但这显然行不通，我不确定这是否是最聪明的解决方法。任何帮助表示赞赏。

Answer 1

解决此问题的一种简单方法是使用ifelse函数和stringr包中的str_detect。

proteins<-read.table(header = TRUE, stringsAsFactors = FALSE, text=
"long_name x y short_name
'Adhesion G protein-coupled receptor E2 (ADGRE2)' 10 10 ADGRE2
'Adhesion G-protein coupled receptor G2 (ADGRG2)' 12 12 ADX2
'ADM (ADM)' 13 13 ADM
'ADP-ribosyl cyclase/cyclic ADP-ribose hydrolase 1 (CD38)' 14 14 ACH1"
)

library(stringr)
proteins$presence<- ifelse( str_detect(proteins$long_name, proteins$short_name ) , "Present",  "Not")

Answer 2

for循环存在一些问题。您要从1:nrow()或1:length()进行迭代。 length(nrow())几乎总是返回1。您的if语句需要带括号，因此应该为if(boolean){return values}else{other return value}。如果数据框的名称为proteins，则以下内容应该起作用。

for (row in 1:nrow(proteins)){

  print(proteins$long_name[row])
  long_name = proteins$long_name[row]
  short_name = proteins$short_name[row]

  if (grepl(short_name, long_name)){
    proteins$presence[row] ="Present"
  } else { 
    proteins$presence[row] = "Not"
  }
}

您还可以通过使用tidyverse软件包dplyr和purrr来进行相同的操作。 purrr提供了同时迭代多个列的功能。

proteins %>%
  dplyr::mutate(short_in_long = purrr::map2_lgl(short_name, long_name, function(x, y){
    grepl(x, y)
  }))

在数据框的另一个字符串中搜索字符串的一部分

2 个答案: