正则表达式模式匹配中的错误,用于将文本检索分成两列数据帧

时间:2017-09-24 08:16:05

标签: r regex perl dataframe

考虑以下假设数据:

x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data 
frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). 
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify 
the row names and not a column (by name or number) Can we go : Please"


y <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data 
frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : 
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify 
the row names and not a column (by name or number) Can we go : Please"

z <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). 
If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify 
the row names and not a column (by name or number) Can we go : Please"

df <- data.frame(Text = c(x, y, z), row.names = NULL, stringsAsFactors = F)

您是否注意到有一个&#34;:&#34;在不同的地方。例如:

  • 在&#39; x&#39;它(&#34;:&#34;)在第一句之后。
  • 在&#39; y&#39;它(&#34;:&#34;)是在第四句之后。
  • 和In&#39; z&#39;这是在第六句之后。
  • 此外还有一个&#34;:&#34;在每篇文章的最后一句之前。

我想做什么,创建两列:

  • 只有第一个&#34;:&#34;是考虑而不是最后一个。
  • 如果有&#34;:&#34;在前三个句子中,然后将整个文本分成两列,否则,将所有文本保留在第二列中并且&#39; NA&#39;在第一栏。

&#39; x&#39;

的通缉输出
 Col1                                                        Col2 
 There is a horror movie running in the iNox theater.        If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

通缉输出&#39; y&#39; (因为&#34;:&#34;因此在前三个句子中找不到):

 Col1     Col2 
 NA       There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

就像上面的&#39; y&#39;的结果一样,&#39; z&#39;应该是:

  Col1    Col2
  NA      all of the text from 'z'

我想做的是:

resX <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[1]]), 
           Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[1]]))

resY <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[2]]), 
           Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[2]]))

resZ <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[3]]), 
           Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[3]]))

然后将上面的内容合并到一个结果数据框中&#34; resDF&#34;使用rbind。

问题是:

  • 以上可以使用&#34; for()循环&#34;或任何其他使代码更简单的方法。
  • &#34; y&#34;的结果和&#34; z&#34;文字不是我想要的(如上所示)。

4 个答案:

答案 0 :(得分:3)

你可以尝试使用这种负面的前瞻性正则表达式:

^(?s)(?!(?:(?:[^:]*?\\.){3,}))(.*?):(.*)$

Regex Demo and Detailed explanation of the regex

  

更新:

如果满足您的条件,则正则表达式将返回true,您应该得到2部分

第1组包含第一个值:第2组将包含值。

如果条件不满足,则将整个字符串复制到第2列,并将任何您想要的内容作为第1列

包含名为流程数据的方法的更新示例代码段将为您完成这些技巧。如果条件满足,那么它将拆分数据并放入col1和col2 ....如果在输入中y和z的情况下不满足条件...它将NA放在col1和整个值中在col2。

运行示例源 - &gt; ideone

library(stringr)

    x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data 
    frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). 
    If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify 
    the row names and not a column (by name or number) Can we go : Please"


    y <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data 
    frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : 
    If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify 
    the row names and not a column (by name or number) Can we go : Please"

    z <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). 
    If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify 
    the row names and not a column (by name or number) Can we go : Please"             


df <- data.frame(Text = c(x, y, z), row.names = NULL, stringsAsFactors = F)

resDF <- data.frame("Col1" = character(), "Col2" = character(), stringsAsFactors=FALSE)

   processData <- function(a) {
        patt <- "^(?s)(?!(?:(?:[^:]*?\\.){3,}))(.*?):(.*)$"    
        if(grepl(patt,a,perl=TRUE))
        {
            result<-str_match(a,patt)    
            col1<-result[2]
            col2<-result[3]
        }
        else
        {
            col1<-"NA"
            col2<-a
        }
       return(c(col1,col2))

    }



for (i in 1:nrow(df)){
tmp <- df[i, ]
resDF[nrow(resDF) + 1, ] <- processData(tmp)
}    


print(resDF)

示例输出:

                                                   Col1
1 There is a horror movie running in the iNox theater. 
2                                                    NA
3                                                    NA
                                                                                                                                                                                                                                                                                                                                                                                                                              Col2
1                                                        If row names are supplied of length one and the data \n    frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). \n    If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify \n    the row names and not a column (by name or number) Can we go : Please
2 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data \n    frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : \n    If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify \n    the row names and not a column (by name or number) Can we go : Please
3      There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). \n    If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify \n    the row names and not a column (by name or number) Can we go : Please

答案 1 :(得分:3)

我被Rizwan's answer的灵感激发了我的灵感,所以你会看到他的答案完成我的。我不喜欢的是它在非句子开始时打破(例如row.names - 尽管提供OP的文本示例不提供row.names所提供的任何示例在前2个句子中出现3次来展示这个)。我还确保捕获组/列的编号与OP期望的完全一致,并且总是匹配。我的答案实际上是对Rizwan的改进。

注1:我假设一个&#34;句子&#34;由句点/点定义,后跟至少一个水平空格.

注2:这适用于PCRE正则表达式,并且未经过其他正则表达式的测试,可能需要适应其他正则表达式才能正常工作(即if / else,垂直空白和水平空白令牌)

代码

See this code in use here

^(?(?!(?:[^:\v]*?\.\h){3,})([^:\v]*?)\s*:\s*|)(.*)$

结果

输入

There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

输出

匹配1

  • 第1组:There is a horror movie running in the iNox theater.
  • 第2组:If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

匹配2

  • 第1组:空 - 不匹配
  • 第2组:There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

匹配3

  • 第1组:空 - 不匹配
  • 第2组:There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

说明

  • ^断言字符串开头的位置
  • (?(?!(?:[^:\v]*?\.\h){3,})([^:\v]*?)\s*:\s*|)
    • (?(?!...)x|y)如果声明使用否定(?!...)作为条件
      • (?:[^:\v]*?\.\h){3,}至少3次匹配以下内容
      • [^:\v]*?任意次数匹配集合中不存在的任何字符(不是冒号或垂直空格字符),但尽可能少
      • \.\h按字面匹配点字符,后跟水平空格字符(空格或制表符)
      • 如果声明 true :如果满足以上条件,请执行以下操作
      • ([^:\v]*?)\s*:\s*
        • ([^:\v]*?)捕获到第1组:任何次数中不存在的任何字符(不是冒号或垂直空白字符),但尽可能少
        • \s*:\s*匹配任意数量的空白字符,后跟冒号,后跟任意数量的空格(请注意,如果总有至少1个空格,则可以将*更改为+在&#34;句子&#34;可能包含:
        • 的情况下,字符尾随/引导冒号可以改善匹配
      • 如果声明 false :未满足以前的条件,请执行以下操作:不匹配
  • (.*)捕获到第2组:任意字符(s标志关闭时排除换行符)任意次数
  • $断言字符串末尾的位置

答案 2 :(得分:2)

否定前瞻是昂贵的,很难阅读。这是一个更简单的解决方案:

library(stringr)

# throw out everything after first :, and count the number of sentences
split = str_count(sub(':.*', '', df$Text), fixed('. ')) < 3

# assemble the required data (you could also avoid ifelse if really needed)
data.frame(col1 = ifelse(split, sub(':.*', '', df$Text), NA),
           col2 = ifelse(split, sub('.*?:', '', df$Text), df$Text))

答案 3 :(得分:1)

分成句子; grep首先出现:,并使用条件分割原始文本:

sp <- strsplit(x, '(?<=\\.)(?=\\s+\\S)', perl = TRUE)[[1L]]
sp <- if (grep(':', sp)[1L] < 3L)
  sub(':\\s+', '$', x) else paste0('$', x)
sp <- gsub('\\v', '', sp, perl = TRUE)

str(read.table(text = sp, sep = '$', col.names = paste0('Col', 1:2), as.is = TRUE))

# 'data.frame': 1 obs. of  2 variables:
#   $ Col1: chr "There is a horror movie running in the iNox theater. "
#   $ Col2: chr "If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names an"| __truncated__

编写一个方便的功能,让您的工作更轻松:

例如,您可以使用不同的标点符号来表示句子的结尾(例如,如果end_of_sentence = '.!?)'中的一个后跟空格,则.!?)会将文本拆分为句子); n允许您控制查找第一个:的句子数量;如果您希望文本中出现sep,则可以更改$(在此处选择可能会在您的文字中出现的字符)

f <- function(text, end_of_sentence = '.', n = 3L, sep = '$') {
  p <- sprintf('(?<=[%s])(?=\\s+\\S)', end_of_sentence)

  sp <- strsplit(text, p, perl = TRUE)[[1L]]
  sp <- if (grep(':', sp)[1L] <= n)
    sub(':\\s+', sep, text) else paste0(sep, text)
  sp <- trimws(gsub('\\v', '', sp, perl = TRUE))

  read.table(text = sp, sep = sep, col.names = paste0('Col', 1:2),
             stringsAsFactors = FALSE)
}

## test
f(x); f(y); f(z)

## vectorize it to work on more than one string
f <- Vectorize(f, SIMPLIFY = FALSE, USE.NAMES = FALSE)

do.call('rbind', f(df$Text))

#   Col1
# 1 There is a horror movie running in the iNox theater. 
# 2                                                  <NA>
# 3                                                  <NA>
#   Col2
# 1 If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
# 2 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
# 3 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please