Vectorize for循环,用于在R中查找出现的内容

时间:2017-07-14 23:09:43

标签: r string for-loop vectorization grepl

我在数据集中有一个变量,其中包含我想在(female$Var2)上进行字符串搜索的短语。我想找到每个短语在另一个数据帧(female_df$MH2)中出现的行数。例如,female$Var2看起来像:

myocardial infarction drug therapy
imipramine poisoning
oximetry
thrombosis drug therapy
angioedema chemically induced

我想在数据框female_df$MH2中找到包含上述每个短语的行数,如下所示

oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects
angioedema chemically induced, angioedema chemically induced, oximetry
abo blood group system, imipramine poisoning, adverse effects
isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy
thrombosis drug therapy

所以我的结果输出应该是这样的:

myocardial infarction drug therapy          1
imipramine poisoning                        1
oximetry                                    2
thrombosis drug therapy                     2
angioedema chemically induced               1

请注意,这不是总发生次数(参见血管神经性水肿......)。它是包含该短语的行数。我目前正在运行一个for循环,因为它在428,000多行上搜索了5000多个术语。当我尝试使用occurrences_female(female$Var2)向我的函数进行向量化时,出现In grepl(word, female_df$MH2, ignore.case = TRUE) : argument 'pattern' has length > 1 and only the first element will be used错误,只返回第一个female$Var2的变量

这是我正在运行的for循环

for (i in 1:nrow(female))
{
  word <- female$Var2[i]
  df_female <- data.frame(word, occurrences_female(word))
  df_female2 <- rbind(df_female2, df_female)
}

基于此功能

occurrences_female <- function(word)
{
  # inserts \\b in the beginning
  word <- paste0("\\b", word)

  # inserts \\b at the end
  n <- nchar(word)
  word <- paste(substr(word, 1, n), "\\b", sep = "")

  occurrences <- sum(grepl(word, female_df$MH2, ignore.case = TRUE))

  return (occurrences)
}

当我手动执行此功能时,该功能可以工作,但是我需要在5,000多个术语上完成它,并且for循环太慢(它已运行超过2小时)。我不知道如何在不同数据帧的变量上搜索数据帧的一个变量。

3 个答案:

答案 0 :(得分:4)

摘要

我们可以使用以下代码来完成任务。基准测试表明,这是一个良好的表现。

library(purrr)
library(stringr)

female$Count <- map_int(female$Var2, 
                    function(x){sum(str_detect(female_df$MH2, pattern = x))})

简介

有多种方法可以计算包含每个单词或短语的行数。但基于此线程的答案和讨论到目前为止,实现这一目标的一般策略。

  1. 使用函数对操作进行矢量化,例如来自基础R的lapplysapply,或来自map包的purrr函数。
  2. 使用函数计算或检测特定模式(单词或短语)是否在字符串中。这些函数类似于基数R的grepgrepl,或str_detect包中的str_whichstringr
  3. 由于OP需要处理大量数据,因此我进行了一项分析,以比较基础R,purrrstringr的哪些功能组合能够以最少的数量完成相同的任务时间

    我调查了总共八种组合。使用sapplymap_intgrepstr_which以及greplstr_detect之间可以有选择。

    数据准备

    在这里,我根据OP的示例创建了两个数据框femalefemale_df。请注意,我设置stringsAsFactors以确保每个整列都是字符格式。

    # Create the example data frame: female
    female <- data.frame(Var2 = c("myocardial infarction drug therapy", 
                                  "imipramine poisoning",
                                  "oximetry",
                                  "thrombosis drug therapy",
                                  "angioedema chemically induced"),
                         stringsAsFactors = FALSE)
    
    # Create the example data frame: female_df
    female_df <- data.frame(MH2 = c("oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects",
                                    "angioedema chemically induced, angioedema chemically induced, oximetry",
                                    "abo blood group system, imipramine poisoning, adverse effects",
                                    "isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy",
                                    "thrombosis drug therapy"),
                            stringsAsFactors = FALSE)
    

    我还加载了所需的包。 microbenchmark是评估代码性能的软件包。

    # Load packages
    library(purrr)
    library(stringr)
    library(microbenchmark)
    

    职能组合

    以下是可以实现OP任务的功能组合列表。

    组合1

    这是来自LuísTelles的回答。它使用sapplygrepl

    sapply(female$Var2, function(x){sum(grepl(pattern = x, female_df$MH2))})
    
    myocardial infarction drug therapy               imipramine poisoning 
                                     1                                  1 
                              oximetry            thrombosis drug therapy 
                                     2                                  2 
         angioedema chemically induced 
                                     1
    

    组合2

    这是来自Dave2e的回答。它使用sapplygrep

    sapply(female$Var2, function(x){length(grep(x, female_df$MH2))})
    
    myocardial infarction drug therapy               imipramine poisoning 
                                     1                                  1 
                              oximetry            thrombosis drug therapy 
                                     2                                  2 
         angioedema chemically induced 
                                     1
    

    组合3

    这使用map_intstr_detect

    map_int(female$Var2, function(x){sum(str_detect(female_df$MH2, pattern = x))})
    [1] 1 1 2 2 1
    

    组合4

    这使用map_intstr_which

    map_int(female$Var2, function(x){length(str_which(female_df$MH2, pattern = x))})
    [1] 1 1 2 2 1
    

    组合5

    这使用map_intgrepl

    map_int(female$Var2, function(x){sum(grepl(pattern = x, female_df$MH2))})
    [1] 1 1 2 2 1
    

    组合6

    这使用map_intgrep

    map_int(female$Var2, function(x){length(grep(x, female_df$MH2))})
    [1] 1 1 2 2 1
    

    组合7

    这使用sapplystr_detect

    sapply(female$Var2, function(x){sum(str_detect(female_df$MH2, pattern = x))})
    myocardial infarction drug therapy               imipramine poisoning 
                                     1                                  1 
                              oximetry            thrombosis drug therapy 
                                     2                                  2 
         angioedema chemically induced 
                                     1
    

    组合8

    这使用sapplystr_which

    sapply(female$Var2, function(x){length(str_which(female_df$MH2, pattern = x))})
    myocardial infarction drug therapy               imipramine poisoning 
                                     1                                  1 
                              oximetry            thrombosis drug therapy 
                                     2                                  2 
         angioedema chemically induced 
                                     1
    

    所有这些组合都是有效的答案。例如,我们可以female$Count <存储这些组合的任何结果。

    微基准

    在这里,我采用30000次采样对这八种组合进行了基准测试。

    m <- microbenchmark(
      C1 = {sapply(female$Var2, function(x){sum(grepl(pattern = x, female_df$MH2))})},
      C2 = {sapply(female$Var2, function(x){length(grep(x, female_df$MH2))})},
      C3 = {map_int(female$Var2, function(x){sum(str_detect(female_df$MH2, pattern = x))})},
      C4 = {map_int(female$Var2, function(x){length(str_which(female_df$MH2, pattern = x))})},
      C5 = {map_int(female$Var2, function(x){sum(grepl(pattern = x, female_df$MH2))})},
      C6 = {map_int(female$Var2, function(x){length(grep(x, female_df$MH2))})},
      C7 = {sapply(female$Var2, function(x){sum(str_detect(female_df$MH2, pattern = x))})},
      C8 = {sapply(female$Var2, function(x){length(str_which(female_df$MH2, pattern = x))})},
      times = 30000L
    )
    
    print(m)
    
    Unit: microseconds
     expr     min      lq     mean   median       uq       max neval
       C1 166.144 200.784 1503.780 2192.261 2401.063 184228.81 30000
       C2 163.578 198.860 1420.937 1460.653 2280.465 144553.22 30000
       C3 189.238 231.575 1502.319  790.305 2386.309 146455.85 30000
       C4 200.784 246.329 1461.714 1224.909 2306.125 184189.04 30000
       C5 150.107 185.388 1452.586 1970.630 2376.687  32124.08 30000
       C6 148.824 184.105 1398.312 1921.556 2259.937 145843.88 30000
       C7 205.916 251.461 1516.979  851.246 2408.119 146305.10 30000
       C8 215.538 264.932 1481.538 1508.764 2324.727 229709.16 30000
    

    所有这些组合具有相似的平均时间,但组合3,map_intstr_detect的使用具有最低的中位数。

答案 1 :(得分:2)

在上面的解决方案中,在处理时间方面,不断使用rbind将每行添加到数据框上是非常昂贵的。

这是使用stringr包的解决方案。

#Data set up
var2<-c("myocardial infarction drug therapy", "imipramine poisoning", "oximetry",
             "thrombosis drug therapy", "angioedema chemically induced")
female<-data.frame(var2, stringsAsFactors = FALSE)

MH2<-c("oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects",
"angioedema chemically induced, angioedema chemically induced, oximetry",
                "abo blood group system, imipramine poisoning, adverse effects",
                "isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy",
                "thrombosis drug therapy")
female_df<-data.frame(MH2, stringsAsFactors = FALSE)

library(stringr)
#create a matrix where columns is the terms
# and the rows are the lines checked.
termmatrix<-sapply(female$var2, function(x){str_count(female_df$MH2, x)})
#find the sums of the columns to determine the number of times each term is used
ans<-colSums(termmatrix)

最终ans是一个带有术语和总计数的命名向量。

<强>加成
为了避免创建一个巨大的术语矩阵,请尝试:

ans<-sapply(female$var2, function(x){length(grep(x, female_df$MH2))})

略微修改路易斯的回答

答案 2 :(得分:2)

仅包含基础R的解决方案(假设您的female$VAR2仅具有唯一字符串):

counts <- sapply(female$VAR2, function(x){ z <- sum(grepl(pattern = x,
                                                    x = female_df$MH2,
                                                    ignore.case = TRUE))
                                      z
                                     })
word_counts <- cbind(female$VAR2, counts)