我在数据集中有一个变量,其中包含我想在(female$Var2
)上进行字符串搜索的短语。我想找到每个短语在另一个数据帧(female_df$MH2
)中出现的行数。例如,female$Var2
看起来像:
myocardial infarction drug therapy
imipramine poisoning
oximetry
thrombosis drug therapy
angioedema chemically induced
我想在数据框female_df$MH2
中找到包含上述每个短语的行数,如下所示
oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects
angioedema chemically induced, angioedema chemically induced, oximetry
abo blood group system, imipramine poisoning, adverse effects
isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy
thrombosis drug therapy
所以我的结果输出应该是这样的:
myocardial infarction drug therapy 1
imipramine poisoning 1
oximetry 2
thrombosis drug therapy 2
angioedema chemically induced 1
请注意,这不是总发生次数(参见血管神经性水肿......)。它是包含该短语的行数。我目前正在运行一个for循环,因为它在428,000多行上搜索了5000多个术语。当我尝试使用occurrences_female(female$Var2)
向我的函数进行向量化时,出现In grepl(word, female_df$MH2, ignore.case = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used
错误,只返回第一个female$Var2
的变量
这是我正在运行的for循环
for (i in 1:nrow(female))
{
word <- female$Var2[i]
df_female <- data.frame(word, occurrences_female(word))
df_female2 <- rbind(df_female2, df_female)
}
基于此功能
occurrences_female <- function(word)
{
# inserts \\b in the beginning
word <- paste0("\\b", word)
# inserts \\b at the end
n <- nchar(word)
word <- paste(substr(word, 1, n), "\\b", sep = "")
occurrences <- sum(grepl(word, female_df$MH2, ignore.case = TRUE))
return (occurrences)
}
当我手动执行此功能时,该功能可以工作,但是我需要在5,000多个术语上完成它,并且for循环太慢(它已运行超过2小时)。我不知道如何在不同数据帧的变量上搜索数据帧的一个变量。
答案 0 :(得分:4)
我们可以使用以下代码来完成任务。基准测试表明,这是一个良好的表现。
library(purrr)
library(stringr)
female$Count <- map_int(female$Var2,
function(x){sum(str_detect(female_df$MH2, pattern = x))})
有多种方法可以计算包含每个单词或短语的行数。但基于此线程的答案和讨论到目前为止,实现这一目标的一般策略。
lapply
和sapply
,或来自map
包的purrr
函数。grep
,grepl
,或str_detect
包中的str_which
或stringr
。由于OP需要处理大量数据,因此我进行了一项分析,以比较基础R,purrr
和stringr
的哪些功能组合能够以最少的数量完成相同的任务时间
我调查了总共八种组合。使用sapply
或map_int
,grep
或str_which
以及grepl
或str_detect
之间可以有选择。
在这里,我根据OP的示例创建了两个数据框female
和female_df
。请注意,我设置stringsAsFactors
以确保每个整列都是字符格式。
# Create the example data frame: female
female <- data.frame(Var2 = c("myocardial infarction drug therapy",
"imipramine poisoning",
"oximetry",
"thrombosis drug therapy",
"angioedema chemically induced"),
stringsAsFactors = FALSE)
# Create the example data frame: female_df
female_df <- data.frame(MH2 = c("oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects",
"angioedema chemically induced, angioedema chemically induced, oximetry",
"abo blood group system, imipramine poisoning, adverse effects",
"isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy",
"thrombosis drug therapy"),
stringsAsFactors = FALSE)
我还加载了所需的包。 microbenchmark
是评估代码性能的软件包。
# Load packages
library(purrr)
library(stringr)
library(microbenchmark)
以下是可以实现OP任务的功能组合列表。
这是来自LuísTelles的回答。它使用sapply
和grepl
。
sapply(female$Var2, function(x){sum(grepl(pattern = x, female_df$MH2))})
myocardial infarction drug therapy imipramine poisoning
1 1
oximetry thrombosis drug therapy
2 2
angioedema chemically induced
1
这是来自Dave2e的回答。它使用sapply
和grep
。
sapply(female$Var2, function(x){length(grep(x, female_df$MH2))})
myocardial infarction drug therapy imipramine poisoning
1 1
oximetry thrombosis drug therapy
2 2
angioedema chemically induced
1
这使用map_int
和str_detect
。
map_int(female$Var2, function(x){sum(str_detect(female_df$MH2, pattern = x))})
[1] 1 1 2 2 1
这使用map_int
和str_which
。
map_int(female$Var2, function(x){length(str_which(female_df$MH2, pattern = x))})
[1] 1 1 2 2 1
这使用map_int
和grepl
。
map_int(female$Var2, function(x){sum(grepl(pattern = x, female_df$MH2))})
[1] 1 1 2 2 1
这使用map_int
和grep
。
map_int(female$Var2, function(x){length(grep(x, female_df$MH2))})
[1] 1 1 2 2 1
这使用sapply
和str_detect
。
sapply(female$Var2, function(x){sum(str_detect(female_df$MH2, pattern = x))})
myocardial infarction drug therapy imipramine poisoning
1 1
oximetry thrombosis drug therapy
2 2
angioedema chemically induced
1
这使用sapply
和str_which
。
sapply(female$Var2, function(x){length(str_which(female_df$MH2, pattern = x))})
myocardial infarction drug therapy imipramine poisoning
1 1
oximetry thrombosis drug therapy
2 2
angioedema chemically induced
1
所有这些组合都是有效的答案。例如,我们可以female$Count <
存储这些组合的任何结果。
在这里,我采用30000次采样对这八种组合进行了基准测试。
m <- microbenchmark(
C1 = {sapply(female$Var2, function(x){sum(grepl(pattern = x, female_df$MH2))})},
C2 = {sapply(female$Var2, function(x){length(grep(x, female_df$MH2))})},
C3 = {map_int(female$Var2, function(x){sum(str_detect(female_df$MH2, pattern = x))})},
C4 = {map_int(female$Var2, function(x){length(str_which(female_df$MH2, pattern = x))})},
C5 = {map_int(female$Var2, function(x){sum(grepl(pattern = x, female_df$MH2))})},
C6 = {map_int(female$Var2, function(x){length(grep(x, female_df$MH2))})},
C7 = {sapply(female$Var2, function(x){sum(str_detect(female_df$MH2, pattern = x))})},
C8 = {sapply(female$Var2, function(x){length(str_which(female_df$MH2, pattern = x))})},
times = 30000L
)
print(m)
Unit: microseconds
expr min lq mean median uq max neval
C1 166.144 200.784 1503.780 2192.261 2401.063 184228.81 30000
C2 163.578 198.860 1420.937 1460.653 2280.465 144553.22 30000
C3 189.238 231.575 1502.319 790.305 2386.309 146455.85 30000
C4 200.784 246.329 1461.714 1224.909 2306.125 184189.04 30000
C5 150.107 185.388 1452.586 1970.630 2376.687 32124.08 30000
C6 148.824 184.105 1398.312 1921.556 2259.937 145843.88 30000
C7 205.916 251.461 1516.979 851.246 2408.119 146305.10 30000
C8 215.538 264.932 1481.538 1508.764 2324.727 229709.16 30000
所有这些组合具有相似的平均时间,但组合3,map_int
和str_detect
的使用具有最低的中位数。
答案 1 :(得分:2)
在上面的解决方案中,在处理时间方面,不断使用rbind
将每行添加到数据框上是非常昂贵的。
这是使用stringr包的解决方案。
#Data set up
var2<-c("myocardial infarction drug therapy", "imipramine poisoning", "oximetry",
"thrombosis drug therapy", "angioedema chemically induced")
female<-data.frame(var2, stringsAsFactors = FALSE)
MH2<-c("oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects",
"angioedema chemically induced, angioedema chemically induced, oximetry",
"abo blood group system, imipramine poisoning, adverse effects",
"isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy",
"thrombosis drug therapy")
female_df<-data.frame(MH2, stringsAsFactors = FALSE)
library(stringr)
#create a matrix where columns is the terms
# and the rows are the lines checked.
termmatrix<-sapply(female$var2, function(x){str_count(female_df$MH2, x)})
#find the sums of the columns to determine the number of times each term is used
ans<-colSums(termmatrix)
最终ans
是一个带有术语和总计数的命名向量。
<强>加成强>
为了避免创建一个巨大的术语矩阵,请尝试:
ans<-sapply(female$var2, function(x){length(grep(x, female_df$MH2))})
略微修改路易斯的回答
答案 2 :(得分:2)
仅包含基础R的解决方案(假设您的female$VAR2
仅具有唯一字符串):
counts <- sapply(female$VAR2, function(x){ z <- sum(grepl(pattern = x,
x = female_df$MH2,
ignore.case = TRUE))
z
})
word_counts <- cbind(female$VAR2, counts)