考虑以下假设数据:
x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data
frame has a single row, the row.names is taken to specify the row names and not a column (by name or number).
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify
the row names and not a column (by name or number) Can we go : Please"
y <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data
frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. :
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify
the row names and not a column (by name or number) Can we go : Please"
z <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number).
If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify
the row names and not a column (by name or number) Can we go : Please"
df <- data.frame(Text = c(x, y, z), row.names = NULL, stringsAsFactors = F)
您是否注意到有一个&#34;:&#34;在不同的地方。例如:
我想做什么,创建两列:
&#39; x&#39;
的通缉输出 Col1 Col2
There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
通缉输出&#39; y&#39; (因为&#34;:&#34;因此在前三个句子中找不到):
Col1 Col2
NA There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
就像上面的&#39; y&#39;的结果一样,&#39; z&#39;应该是:
Col1 Col2
NA all of the text from 'z'
我想做的是:
resX <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[1]]),
Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[1]]))
resY <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[2]]),
Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[2]]))
resZ <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[3]]),
Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[3]]))
然后将上面的内容合并到一个结果数据框中&#34; resDF&#34;使用rbind。
问题是:
答案 0 :(得分:3)
你可以尝试使用这种负面的前瞻性正则表达式:
^(?s)(?!(?:(?:[^:]*?\\.){3,}))(.*?):(.*)$
Regex Demo and Detailed explanation of the regex
更新:
如果满足您的条件,则正则表达式将返回true,您应该得到2部分
第1组包含第一个值:第2组将包含值。
如果条件不满足,则将整个字符串复制到第2列,并将任何您想要的内容作为第1列
包含名为流程数据的方法的更新示例代码段将为您完成这些技巧。如果条件满足,那么它将拆分数据并放入col1和col2 ....如果在输入中y和z的情况下不满足条件...它将NA放在col1和整个值中在col2。
运行示例源 - &gt; ideone:
library(stringr)
x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data
frame has a single row, the row.names is taken to specify the row names and not a column (by name or number).
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify
the row names and not a column (by name or number) Can we go : Please"
y <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data
frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. :
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify
the row names and not a column (by name or number) Can we go : Please"
z <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number).
If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify
the row names and not a column (by name or number) Can we go : Please"
df <- data.frame(Text = c(x, y, z), row.names = NULL, stringsAsFactors = F)
resDF <- data.frame("Col1" = character(), "Col2" = character(), stringsAsFactors=FALSE)
processData <- function(a) {
patt <- "^(?s)(?!(?:(?:[^:]*?\\.){3,}))(.*?):(.*)$"
if(grepl(patt,a,perl=TRUE))
{
result<-str_match(a,patt)
col1<-result[2]
col2<-result[3]
}
else
{
col1<-"NA"
col2<-a
}
return(c(col1,col2))
}
for (i in 1:nrow(df)){
tmp <- df[i, ]
resDF[nrow(resDF) + 1, ] <- processData(tmp)
}
print(resDF)
示例输出:
Col1
1 There is a horror movie running in the iNox theater.
2 NA
3 NA
Col2
1 If row names are supplied of length one and the data \n frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). \n If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify \n the row names and not a column (by name or number) Can we go : Please
2 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data \n frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : \n If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify \n the row names and not a column (by name or number) Can we go : Please
3 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). \n If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify \n the row names and not a column (by name or number) Can we go : Please
答案 1 :(得分:3)
我被Rizwan's answer的灵感激发了我的灵感,所以你会看到他的答案完成我的。我不喜欢的是它在非句子开始时打破(例如row.names
- 尽管提供OP的文本示例不提供row.names
所提供的任何示例在前2个句子中出现3次来展示这个)。我还确保捕获组/列的编号与OP期望的完全一致,并且总是匹配。我的答案实际上是对Rizwan的改进。
注1:我假设一个&#34;句子&#34;由句点/点定义,后跟至少一个水平空格.
注2:这适用于PCRE正则表达式,并且未经过其他正则表达式的测试,可能需要适应其他正则表达式才能正常工作(即if / else,垂直空白和水平空白令牌)
^(?(?!(?:[^:\v]*?\.\h){3,})([^:\v]*?)\s*:\s*|)(.*)$
There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
匹配1
There is a horror movie running in the iNox theater.
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
匹配2
There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
匹配3
There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
^
断言字符串开头的位置(?(?!(?:[^:\v]*?\.\h){3,})([^:\v]*?)\s*:\s*|)
(?(?!...)x|y)
如果声明使用否定(?!...)
作为条件
(?:[^:\v]*?\.\h){3,}
至少3次匹配以下内容[^:\v]*?
任意次数匹配集合中不存在的任何字符(不是冒号或垂直空格字符),但尽可能少\.\h
按字面匹配点字符,后跟水平空格字符(空格或制表符)([^:\v]*?)\s*:\s*
([^:\v]*?)
捕获到第1组:任何次数中不存在的任何字符(不是冒号或垂直空白字符),但尽可能少\s*:\s*
匹配任意数量的空白字符,后跟冒号,后跟任意数量的空格(请注意,如果总有至少1个空格,则可以将*
更改为+
在&#34;句子&#34;可能包含:
)(.*)
捕获到第2组:任意字符(s
标志关闭时排除换行符)任意次数$
断言字符串末尾的位置答案 2 :(得分:2)
否定前瞻是昂贵的,很难阅读。这是一个更简单的解决方案:
library(stringr)
# throw out everything after first :, and count the number of sentences
split = str_count(sub(':.*', '', df$Text), fixed('. ')) < 3
# assemble the required data (you could also avoid ifelse if really needed)
data.frame(col1 = ifelse(split, sub(':.*', '', df$Text), NA),
col2 = ifelse(split, sub('.*?:', '', df$Text), df$Text))
答案 3 :(得分:1)
分成句子; grep首先出现:
,并使用条件分割原始文本:
sp <- strsplit(x, '(?<=\\.)(?=\\s+\\S)', perl = TRUE)[[1L]]
sp <- if (grep(':', sp)[1L] < 3L)
sub(':\\s+', '$', x) else paste0('$', x)
sp <- gsub('\\v', '', sp, perl = TRUE)
str(read.table(text = sp, sep = '$', col.names = paste0('Col', 1:2), as.is = TRUE))
# 'data.frame': 1 obs. of 2 variables:
# $ Col1: chr "There is a horror movie running in the iNox theater. "
# $ Col2: chr "If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names an"| __truncated__
编写一个方便的功能,让您的工作更轻松:
例如,您可以使用不同的标点符号来表示句子的结尾(例如,如果end_of_sentence = '.!?)'
中的一个后跟空格,则.!?)
会将文本拆分为句子); n
允许您控制查找第一个:
的句子数量;如果您希望文本中出现sep
,则可以更改$
(在此处选择可能会在您的文字中出现的字符)
f <- function(text, end_of_sentence = '.', n = 3L, sep = '$') {
p <- sprintf('(?<=[%s])(?=\\s+\\S)', end_of_sentence)
sp <- strsplit(text, p, perl = TRUE)[[1L]]
sp <- if (grep(':', sp)[1L] <= n)
sub(':\\s+', sep, text) else paste0(sep, text)
sp <- trimws(gsub('\\v', '', sp, perl = TRUE))
read.table(text = sp, sep = sep, col.names = paste0('Col', 1:2),
stringsAsFactors = FALSE)
}
## test
f(x); f(y); f(z)
## vectorize it to work on more than one string
f <- Vectorize(f, SIMPLIFY = FALSE, USE.NAMES = FALSE)
do.call('rbind', f(df$Text))
# Col1
# 1 There is a horror movie running in the iNox theater.
# 2 <NA>
# 3 <NA>
# Col2
# 1 If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
# 2 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please
# 3 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please