我有一套句子,
{ cat ate rat, rat was killed, cat killed the rat, rat killed by rat}
。
首先)我想搜索列col2中的值是否包含任何这些句子
秒> 如果匹配,那么我想检查Col3中的日期是否在col4和col5中的开始日期和结束日期之间。
这是一个测试数据集
Id Col2 Col3 Col4 Col5
1 This cat 05-09-2001 04-10-2000 09-14-2001
2 This cat ate a rat 05-04-2011 05-01-2011 05-14-2011
3 Cat was killed 02-04-2015 02-01-2015 03-12-2015
4 Cat killed the rat 10-06-2014 09-20-2014 10-11-2014
5 Rat ran away 03-12-2008 04-12-2015 04-20-2015
这是预期的输出
Id Col2 Col3 Col4 Col5 Event
1 This cat 05-09-2001 04-10-2000 09-14-2001 No
2 Cat ate rat 05-04-2011 05-01-2011 05-14-2011 Yes
3 Cat died 02-04-2015 02-01-2015 03-12-2015 No
4 Cat killed the rat 10-06-2014 09-20-2014 10-11-2014 Yes
5 Rat ran away 03-12-2008 04-12-2015 04-20-2015 No
这是id到目前为止所做的事情。以下代码正在运行。我得到了我想要的结果,但效率非常低。它很慢,需要很长时间。特别是如果我的df包含300万行,我将花费10天时间来完成此代码的运行。任何关于解决这个问题的有效方法的建议都非常感激。
关键词< - c(“猫吃老鼠”,“老鼠被杀死”,“猫杀死老鼠”,“老鼠杀死老鼠”)
for (i in 1:NROW(Df)) {
if( NROW(Df[grep(paste0(keywords, collapse = "|"), Df$Col2[i]),]) > 0) {
if ( (Df$Col3[i] > Df$Col4[i]) & (Df$Col3[i] < Df$Col5[i]) ){
Df$Event <- "Yes"
} else {
Df$Event <- "No"
}
}
print(i)
}
答案 0 :(得分:0)
基本上你需要测试三个条件。
Col3
&gt; = Col4
Col3
&lt; = Col5
Col2
出现在关键字使用ifelse
或%in%
等矢量化函数来加速您的代码。
mydf <- structure(list(Id = 1:5, Col2 = c("This cat", "This cat ate a rat",
"Cat was killed", "Cat killed the rat", "Rat ran away"), Col3 = structure(c(11451,
15098, 16470, 16349, 13950), class = "Date"), Col4 = structure(c(11057,
15095, 16467, 16333, 16537), class = "Date"), Col5 = structure(c(11579,
15108, 16506, 16354, 16545), class = "Date")), .Names = c("Id",
"Col2", "Col3", "Col4", "Col5"), row.names = c(NA, -5L), class = "data.frame")
keywords <- c("cat ate rat", "rat was killed", "cat killed the rat", "rat killed by rat")
mydf$event <- ifelse((mydf$Col3 >= mydf$Col4) & (mydf$Col3 <= mydf$Col5)
& mydf$Col2 %in% keywords, "Yes", "No")
请注意,此版本区分大小写。您可能会对tolower
等函数感兴趣。
mydf$event <- ifelse((mydf$Col3 >= mydf$Col4) & (mydf$Col3 <= mydf$Col5)
& tolower(mydf$Col2) %in% keywords, "Yes", "No")
答案 1 :(得分:0)
简答:
df$Event <- sapply(tolower(df$Col2), function(el) el %in% sentences)
在for循环中做你想要的。
在R中,您必须避免使用for循环并尝试使用 apply
- family 函数。
tolower
会将df $ Col2的内容设为小写。
对于此列向量的每个元素,已定义的函数 function(el) el %in% sentences
已应用(它会询问每个元素是否为sentences
字符的一部分向量,并首先将布尔结果收集到列表中,但随后,它会尝试 s 将收集的结果进一步展示为向量( sapply
)。
完整的工作代码版本:
数据读入和准备
sentences <- unlist(strsplit("cat ate rat, rat was killed, cat killed the rat, rat killed by rat",", "))
只是将您的给定文本更改为数据框
txt2df <- function(dfstr) {
lines <- unlist(strsplit(txt, "\n"))
l <- unlist(lapply(lines,strsplit, "\ {2, }"), recursive = FALSE)
df <- as.data.frame(Reduce(rbind, l[2:length(l)]), row.names = FALSE)
colnames(df) <- l[[1]]
df
}
将该函数应用于多行字符串以获取data.frame:
df <- txt2df("Id Col2 Col3 Col4 Col5
1 This cat 05-09-2001 04-10-2000 09-14-2001
2 This cat ate a rat 05-04-2011 05-01-2011 05-14-2011
3 Cat was killed 02-04-2015 02-01-2015 03-12-2015
4 Cat killed the rat 10-06-2014 09-20-2014 10-11-2014
5 Rat ran away 03-12-2008 04-12-2015 04-20-2015")
df
Id Col2 Col3 Col4 Col5
1 1 This cat 05-09-2001 04-10-2000 09-14-2001
2 2 This cat ate a rat 05-04-2011 05-01-2011 05-14-2011
3 3 Cat was killed 02-04-2015 02-01-2015 03-12-2015
4 4 Cat killed the rat 10-06-2014 09-20-2014 10-11-2014
5 5 Rat ran away 03-12-2008 04-12-2015 04-20-2015
查找功能
查找df $ Col2值的小写是否是以下任何一个句子:
df$Event <- sapply(tolower(df$Col2), function(el) el %in% sentences)
<强>结果强>
df
Id Col2 Col3 Col4 Col5 Event
1 1 This cat 05-09-2001 04-10-2000 09-14-2001 FALSE
2 2 This cat ate a rat 05-04-2011 05-01-2011 05-14-2011 FALSE
3 3 Cat was killed 02-04-2015 02-01-2015 03-12-2015 FALSE
4 4 Cat killed the rat 10-06-2014 09-20-2014 10-11-2014 TRUE
5 5 Rat ran away 03-12-2008 04-12-2015 04-20-2015 FALSE