Question

考虑以下假设数据：

x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data 
frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). 
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify 
the row names and not a column (by name or number) Can we go : Please"


y <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data 
frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : 
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify 
the row names and not a column (by name or number) Can we go : Please"

z <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). 
If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify 
the row names and not a column (by name or number) Can we go : Please"

df <- data.frame(Text = c(x, y, z), row.names = NULL, stringsAsFactors = F)

您是否注意到有一个＆＃34;：＆＃34;在不同的地方。例如：

在＆＃39; x＆＃39;它（＆＃34;：＆＃34;）在第一句之后。
在＆＃39; y＆＃39;它（＆＃34;：＆＃34;）是在第四句之后。
和In＆＃39; z＆＃39;这是在第六句之后。
此外还有一个＆＃34;：＆＃34;在每篇文章的最后一句之前。

我想做什么，创建两列：

只有第一个＆＃34;：＆＃34;是考虑而不是最后一个。
如果有＆＃34;：＆＃34;在前三个句子中，然后将整个文本分成两列，否则，将所有文本保留在第二列中并且＆＃39; NA＆＃39;在第一栏。

＆＃39; x＆＃39;

的通缉输出

 Col1                                                        Col2 
 There is a horror movie running in the iNox theater.        If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

通缉输出＆＃39; y＆＃39; （因为＆＃34;：＆＃34;因此在前三个句子中找不到）：

 Col1     Col2 
 NA       There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

就像上面的＆＃39; y＆＃39;的结果一样，＆＃39; z＆＃39;应该是：

  Col1    Col2
  NA      all of the text from 'z'

我想做的是：

resX <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[1]]), 
           Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[1]]))

resY <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[2]]), 
           Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[2]]))

resZ <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[3]]), 
           Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[3]]))

然后将上面的内容合并到一个结果数据框中＆＃34; resDF＆＃34;使用rbind。

问题是：

以上可以使用＆＃34; for（）循环＆＃34;或任何其他使代码更简单的方法。
＆＃34; y＆＃34;的结果和＆＃34; z＆＃34;文字不是我想要的（如上所示）。

Answer 1

你可以尝试使用这种负面的前瞻性正则表达式：

^(?s)(?!(?:(?:[^:]*?\\.){3,}))(.*?):(.*)$

Regex Demo and Detailed explanation of the regex

更新：

如果满足您的条件，则正则表达式将返回true，您应该得到2部分

第1组包含第一个值：第2组将包含值。

如果条件不满足，则将整个字符串复制到第2列，并将任何您想要的内容作为第1列

包含名为流程数据的方法的更新示例代码段将为您完成这些技巧。如果条件满足，那么它将拆分数据并放入col1和col2 ....如果在输入中y和z的情况下不满足条件...它将NA放在col1和整个值中在col2。

运行示例源 - ＆gt; ideone：

library(stringr)

    x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data 
    frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). 
    If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify 
    the row names and not a column (by name or number) Can we go : Please"


    y <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data 
    frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : 
    If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify 
    the row names and not a column (by name or number) Can we go : Please"

    z <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). 
    If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify 
    the row names and not a column (by name or number) Can we go : Please"             


df <- data.frame(Text = c(x, y, z), row.names = NULL, stringsAsFactors = F)

resDF <- data.frame("Col1" = character(), "Col2" = character(), stringsAsFactors=FALSE)

   processData <- function(a) {
        patt <- "^(?s)(?!(?:(?:[^:]*?\\.){3,}))(.*?):(.*)$"    
        if(grepl(patt,a,perl=TRUE))
        {
            result<-str_match(a,patt)    
            col1<-result[2]
            col2<-result[3]
        }
        else
        {
            col1<-"NA"
            col2<-a
        }
       return(c(col1,col2))

    }



for (i in 1:nrow(df)){
tmp <- df[i, ]
resDF[nrow(resDF) + 1, ] <- processData(tmp)
}    


print(resDF)

示例输出：

                                                   Col1
1 There is a horror movie running in the iNox theater. 
2                                                    NA
3                                                    NA
                                                                                                                                                                                                                                                                                                                                                                                                                              Col2
1                                                        If row names are supplied of length one and the data \n    frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). \n    If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify \n    the row names and not a column (by name or number) Can we go : Please
2 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data \n    frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : \n    If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify \n    the row names and not a column (by name or number) Can we go : Please
3      There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). \n    If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify \n    the row names and not a column (by name or number) Can we go : Please

Answer 2

简

我被Rizwan's answer的灵感激发了我的灵感，所以你会看到他的答案完成我的。我不喜欢的是它在非句子开始时打破（例如row.names - 尽管提供OP的文本示例不提供row.names所提供的任何示例在前2个句子中出现3次来展示这个）。我还确保捕获组/列的编号与OP期望的完全一致，并且总是匹配。我的答案实际上是对Rizwan的改进。

注1：我假设一个＆＃34;句子＆＃34;由句点/点定义，后跟至少一个水平空格.

注2：这适用于PCRE正则表达式，并且未经过其他正则表达式的测试，可能需要适应其他正则表达式才能正常工作（即if / else，垂直空白和水平空白令牌）

代码

See this code in use here

^(?(?!(?:[^:\v]*?\.\h){3,})([^:\v]*?)\s*:\s*|)(.*)$

结果

输入

There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

输出

匹配1

第1组：There is a horror movie running in the iNox theater.
第2组：If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

匹配2

第1组：空 - 不匹配
第2组：There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

匹配3

第1组：空 - 不匹配
第2组：There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

说明

^断言字符串开头的位置
(?(?!(?:[^:\v]*?\.\h){3,})([^:\v]*?)\s*:\s*|)
- (?(?!...)x|y)如果声明使用否定(?!...)作为条件
  - (?:[^:\v]*?\.\h){3,}至少3次匹配以下内容
  - [^:\v]*?任意次数匹配集合中不存在的任何字符（不是冒号或垂直空格字符），但尽可能少
  - \.\h按字面匹配点字符，后跟水平空格字符（空格或制表符）
  - 如果声明 true ：如果满足以上条件，请执行以下操作
  - ([^:\v]*?)\s*:\s*
    - ([^:\v]*?)捕获到第1组：任何次数中不存在的任何字符（不是冒号或垂直空白字符），但尽可能少
    - \s*:\s*匹配任意数量的空白字符，后跟冒号，后跟任意数量的空格（请注意，如果总有至少1个空格，则可以将*更改为+在＆＃34;句子＆＃34;可能包含:）
  - 如果声明 false ：未满足以前的条件，请执行以下操作：不匹配
(.*)捕获到第2组：任意字符（s标志关闭时排除换行符）任意次数
$断言字符串末尾的位置

Answer 3

否定前瞻是昂贵的，很难阅读。这是一个更简单的解决方案：

library(stringr)

# throw out everything after first :, and count the number of sentences
split = str_count(sub(':.*', '', df$Text), fixed('. ')) < 3

# assemble the required data (you could also avoid ifelse if really needed)
data.frame(col1 = ifelse(split, sub(':.*', '', df$Text), NA),
           col2 = ifelse(split, sub('.*?:', '', df$Text), df$Text))

Answer 4

分成句子; grep首先出现:，并使用条件分割原始文本：

sp <- strsplit(x, '(?<=\\.)(?=\\s+\\S)', perl = TRUE)[[1L]]
sp <- if (grep(':', sp)[1L] < 3L)
  sub(':\\s+', '$', x) else paste0('$', x)
sp <- gsub('\\v', '', sp, perl = TRUE)

str(read.table(text = sp, sep = '$', col.names = paste0('Col', 1:2), as.is = TRUE))

# 'data.frame': 1 obs. of  2 variables:
#   $ Col1: chr "There is a horror movie running in the iNox theater. "
#   $ Col2: chr "If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names an"| __truncated__

编写一个方便的功能，让您的工作更轻松：

例如，您可以使用不同的标点符号来表示句子的结尾（例如，如果end_of_sentence = '.!?)'中的一个后跟空格，则.!?)会将文本拆分为句子）; n允许您控制查找第一个:的句子数量;如果您希望文本中出现sep，则可以更改$（在此处选择可能会在您的文字中出现的字符）

f <- function(text, end_of_sentence = '.', n = 3L, sep = '$') {
  p <- sprintf('(?<=[%s])(?=\\s+\\S)', end_of_sentence)

  sp <- strsplit(text, p, perl = TRUE)[[1L]]
  sp <- if (grep(':', sp)[1L] <= n)
    sub(':\\s+', sep, text) else paste0(sep, text)
  sp <- trimws(gsub('\\v', '', sp, perl = TRUE))

  read.table(text = sp, sep = sep, col.names = paste0('Col', 1:2),
             stringsAsFactors = FALSE)
}

## test
f(x); f(y); f(z)

## vectorize it to work on more than one string
f <- Vectorize(f, SIMPLIFY = FALSE, USE.NAMES = FALSE)

do.call('rbind', f(df$Text))

#   Col1
# 1 There is a horror movie running in the iNox theater. 
# 2                                                  <NA>
# 3                                                  <NA>
#   Col2
# 1 If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
# 2 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
# 3 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

正则表达式模式匹配中的错误，用于将文本检索分成两列数据帧

4 个答案:

简

代码

结果

输入

输出

说明