如何使用正则表达式和R提取比赛上方的线?

时间:2018-08-02 10:59:06

标签: r regex

我想使用R匹配一些特定的字符串,并且只保留该匹配上方的行,这是一些示例数据。包含数百个类似案例的文件:

first_case<- data.frame(line = 

             c("#John Wayne: Su, 11.01.2013 08:24:42#
                He is present / I guess, Does great job
                --------------------------------------------------
                #Michal Thorn: Fr, 12.09.2015 17:23:01#
                Works quite frequently with people
                --------------------------------------------------
                #Sandra Nunes: Mo, 20.05.2011 09:00:29#
                She has some new clients"))



second_case<- data.frame(line = 

                c("#Boris Jonson: Mo, 30.09.2017 09:20:42#
                He is present
                --------------------------------------------------
                #Jacky Fine: Th, 02.02.2013 18:23:01#
                Does great job
                --------------------------------------------------
                #Michael Bissping: Mo, 25.03.2012 10:00:29#
                Hard to count on"))



third_case<- data.frame(line = 

              c("#Isabelle Warren: Sa, 02.12.2013 02:24:42#
                 Not around / anymore
               --------------------------------------------------
                 #Tobias Maker: Mo, 02.03.2013 10:23:01#
                 Works quite frequently with people
               --------------------------------------------------
                 #Toe Michael : Mo, 20.05.2011 09:00:29#
                 She has some new clients & Does great job"))

all_cases <- rbind(first_case,second_case,third_case)

在这里,我尝试过滤那些位于上方1行的行:

Does great job

通过查看Does great job是否以新行结尾并采用上面的第一行:

dplyr::filter(all_cases, grepl("((.*\n){1})Does great job",line))

预期结果:

first_case<- data.frame(line = 
                      c("#John Wayne: Su, 11.01.2013 08:24:42#"))
second_case<- data.frame(line = 
                       c("#Jacky Fine: Th, 02.02.2013 18:23:01#"))
third_case<- data.frame(line = 
                      c("#Toe Michael : Mo, 20.05.2011 09:00:29#"))

expected_result <- rbind(first_case,second_case,third_case)

1   #John Wayne: Su, 11.01.2013 08:24:42#
2   #Jacky Fine: Th, 02.02.2013 18:23:01#
3   #Toe Michael : Mo, 20.05.2011 09:00:29#

不幸的是,这将返回零行。感谢任何见解!

3 个答案:

答案 0 :(得分:3)

这是一种使用strsplit的基本R方法。我们可以形成行的列表/向量,然后直接使用grep查找与Does great job匹配的行的索引。然后,只需返回紧接其前的行即可。

line <- "#Boris Jonson: Mo, 30.09.2017 09:20:42#
         He is present
         --------------------------------------------------
         #Jacky Fine: Th, 02.02.2013 18:23:01#
         Does great job
         --------------------------------------------------
         #Michael Bissping: Mo, 25.03.2012 10:00:29#
         Hard to count on"

terms <- unlist(strsplit(line, "\n"))
terms[grep("Does great job", terms) - 1]

[1] "                #Jacky Fine: Th, 02.02.2013 18:23:01#"

Demo

我的答案没有涵盖很多边缘情况,第一个是匹配逻辑。如果搜索词匹配多次或根本不匹配怎么办?另外,grep中使用的模式应该有多具体?

答案 1 :(得分:3)

您可以尝试:

library(stringr)
library(dplyr)

all_cases %>% transmute(x=str_extract(line,".*(?=\n.*?Does great job)"))

#                                                         x
#1                    #John Wayne: Su, 11.01.2013 08:24:42#
#2                    #Jacky Fine: Th, 02.02.2013 18:23:01#
#3                  #Toe Michael : Mo, 20.05.2011 09:00:29#

改进的解决方案,以便独立地利用每人三个人的每一行:

all_cases %>% separate(line,c("a","b","c"),sep="-{3,}") %>%
  gather(k,v,a,b,c) %>%
  transmute(x=str_extract(v,".*(?=\n.*?Does great job)")) %>%
  filter(!is.na(x))

答案 2 :(得分:1)

尝试以下模式:(.+)\n(.*[dD]oes great job.*)。您将需要第一个捕获组\1

Demo

注意:我认为.\n不匹配。