r根据关键字搜索字符串,并检查日期是否在开始日期和结束日期之间

时间:2018-05-09 22:38:21

标签: r string date filter text-mining

我有一套句子,

{ cat ate rat, rat was killed, cat killed the rat, rat killed by rat}

首先)我想搜索列col2中的值是否包含任何这些句子

秒> 如果匹配,那么我想检查Col3中的日期是否在col4和col5中的开始日期和结束日期之间。

这是一个测试数据集

Id      Col2                Col3        Col4        Col5
1       This cat            05-09-2001  04-10-2000  09-14-2001
2       This cat ate a rat  05-04-2011  05-01-2011  05-14-2011
3       Cat was killed      02-04-2015  02-01-2015  03-12-2015
4       Cat killed the rat  10-06-2014  09-20-2014  10-11-2014
5       Rat ran away        03-12-2008  04-12-2015  04-20-2015

这是预期的输出

Id      Col2                Col3        Col4        Col5         Event
1       This cat            05-09-2001  04-10-2000  09-14-2001   No
2       Cat ate rat         05-04-2011  05-01-2011  05-14-2011   Yes
3       Cat died            02-04-2015  02-01-2015  03-12-2015   No
4       Cat killed the rat  10-06-2014  09-20-2014  10-11-2014   Yes
5       Rat ran away        03-12-2008  04-12-2015  04-20-2015   No

这是id到目前为止所做的事情。以下代码正在运行。我得到了我想要的结果,但效率非常低。它很慢,需要很长时间。特别是如果我的df包含300万行,我将花费10天时间来完成此代码的运行。任何关于解决这个问题的有效方法的建议都非常感激。

关键词< - c(“猫吃老鼠”,“老鼠被杀死”,“猫杀死老鼠”,“老鼠杀死老鼠”)

for (i in 1:NROW(Df)) {

         if( NROW(Df[grep(paste0(keywords, collapse = "|"), Df$Col2[i]),]) > 0) {

             if ( (Df$Col3[i] > Df$Col4[i]) & (Df$Col3[i] < Df$Col5[i]) ){
               Df$Event <- "Yes"
             } else {
               Df$Event <- "No"
             }


         }
        print(i)
      }

2 个答案:

答案 0 :(得分:0)

基本上你需要测试三个条件。

  • Col3&gt; = Col4
  • Col3&lt; = Col5
  • Col2出现在关键字

使用ifelse%in%等矢量化函数来加速您的代码。

mydf <- structure(list(Id = 1:5, Col2 = c("This cat", "This cat ate a rat", 
"Cat was killed", "Cat killed the rat", "Rat ran away"), Col3 = structure(c(11451, 
15098, 16470, 16349, 13950), class = "Date"), Col4 = structure(c(11057, 
15095, 16467, 16333, 16537), class = "Date"), Col5 = structure(c(11579, 
15108, 16506, 16354, 16545), class = "Date")), .Names = c("Id", 
"Col2", "Col3", "Col4", "Col5"), row.names = c(NA, -5L), class = "data.frame")

keywords <- c("cat ate rat", "rat was killed", "cat killed the rat", "rat killed by rat")

mydf$event <- ifelse((mydf$Col3 >= mydf$Col4) & (mydf$Col3 <= mydf$Col5) 
                      & mydf$Col2 %in% keywords, "Yes", "No")

请注意,此版本区分大小写。您可能会对tolower等函数感兴趣。

mydf$event <- ifelse((mydf$Col3 >= mydf$Col4) & (mydf$Col3 <= mydf$Col5) 
                     & tolower(mydf$Col2) %in% keywords, "Yes", "No")

答案 1 :(得分:0)

简答:

df$Event <- sapply(tolower(df$Col2), function(el) el %in% sentences)

在for循环中做你想要的。

在R中,您必须避免使用for循环并尝试使用 apply - family 函数。 tolower 会将df $ Col2的内容设为小写。 对于此列向量的每个元素,已定义的函数 function(el) el %in% sentences 已应用(它会询问每个元素是否为sentences字符的一部分向量,并首先将布尔结果收集到列表中,但随后,它会尝试 s 将收集的结果进一步展示为向量( sapply )。

完整的工作代码版本:

数据读入和准备

sentences <- unlist(strsplit("cat ate rat, rat was killed, cat killed the rat, rat killed by rat",", "))

只是将您的给定文本更改为数据框

txt2df <- function(dfstr) {
  lines <- unlist(strsplit(txt, "\n"))
  l <- unlist(lapply(lines,strsplit, "\ {2, }"), recursive = FALSE)
  df <- as.data.frame(Reduce(rbind, l[2:length(l)]), row.names = FALSE)
  colnames(df) <- l[[1]]
  df
}

将该函数应用于多行字符串以获取data.frame:

df <- txt2df("Id      Col2                Col3        Col4        Col5
1       This cat            05-09-2001  04-10-2000  09-14-2001
2       This cat ate a rat  05-04-2011  05-01-2011  05-14-2011
3       Cat was killed      02-04-2015  02-01-2015  03-12-2015
4       Cat killed the rat  10-06-2014  09-20-2014  10-11-2014
5       Rat ran away        03-12-2008  04-12-2015  04-20-2015")


df

  Id               Col2       Col3       Col4       Col5
1  1           This cat 05-09-2001 04-10-2000 09-14-2001
2  2 This cat ate a rat 05-04-2011 05-01-2011 05-14-2011
3  3     Cat was killed 02-04-2015 02-01-2015 03-12-2015
4  4 Cat killed the rat 10-06-2014 09-20-2014 10-11-2014
5  5       Rat ran away 03-12-2008 04-12-2015 04-20-2015

查找功能

查找df $ Col2值的小写是否是以下任何一个句子:

df$Event <- sapply(tolower(df$Col2), function(el) el %in% sentences)

<强>结果

df

  Id               Col2       Col3       Col4       Col5 Event
1  1           This cat 05-09-2001 04-10-2000 09-14-2001 FALSE
2  2 This cat ate a rat 05-04-2011 05-01-2011 05-14-2011 FALSE
3  3     Cat was killed 02-04-2015 02-01-2015 03-12-2015 FALSE
4  4 Cat killed the rat 10-06-2014 09-20-2014 10-11-2014  TRUE
5  5       Rat ran away 03-12-2008 04-12-2015 04-20-2015 FALSE