如何基于时间戳模式匹配将大字符分割成行

时间:2017-08-21 16:26:00

标签: r

我需要将以下字符解析为3行。 每行以时间戳开头,但可以选择跨越多个后续行:

e.g。

x <- c('02-May-2017 10:10:41: some description
02-May-2017 10:10:42: some description
  some more
  and more
  02-May-2017 10:10:43: xyz')

基本方法是搜索时间戳模式的第一个匹配项,记住起始位置并搜索下一个 时间戳从上一场比赛结束开始,并提取两者之间的租船人。

任何想法是否有一种有效的方法来实现这一目标。

顺便说一下,所需的输出是:

[1] 02-May-2017 10:10:41: some description
[2] 02-May-2017 10:10:42: some description some more and more
[3] 02-May-2017 10:10:43: xyz

3 个答案:

答案 0 :(得分:3)

这是我的尝试:

# read in line
x <- c('02-May-2017 10:10:41: some description
02-May-2017 10:10:42: some description
some more
and more
02-May-2017 10:10:43: xyz')

# remove line breaks
x <- gsub("\n", " ", x)

# regex pattern for timestamp
pattern <- "[0-9]{2}-[A-Z][a-z]{2}-[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}"

# split lines without time stamps
x.lines <- strsplit(x, pattern)[[1]][-1]

# corresponding time stamps
x.stamps <- stringr::str_extract_all(x, pattern)[[1]]

lapply(seq_along(x.stamps), function(i) {paste0(x.stamps[i], x.lines[i])}) %>% unlist()

[1] "02-May-2017 10:10:41: some description "                   
[2] "02-May-2017 10:10:42: some description some more and more "
[3] "02-May-2017 10:10:43: xyz"  

答案 1 :(得分:2)

这有效:

res = strsplit(x, "\\s+(?=\\d{2}\\-)", perl=TRUE)[[1]]

[1] "02-May-2017 10:10:41: some description"                         
[2] "02-May-2017 10:10:42: some description\n  some more\n  and more"
[3] "02-May-2017 10:10:43: xyz"  

如果需要,您可以在\n之后使用gsub删除gsub("\\n ", "", res)或类似内容。

这取决于01-或类似的“更多”行。如果他们这样做,?=模式可以扩展为更独特。

答案 2 :(得分:1)

带有正则表达式解决方案的字符串

library(stringr)

y <- str_replace_all(x, "\\n", "")                                  # remove next lines
words <- unlist(str_split(y, "\\d+-\\D+-\\d+\\s+\\d+:\\d+:\\d+:"))  # uses regex to split string at date-like strings
words <- words[words!=""]                                           # remove empty string == ""
timestamps <- unlist(str_extract_all(x, "\\d+-\\D+-\\d+\\s+\\d+:\\d+:\\d+:"))      # extracts date-like strings

paste0(timestamps, words)

输出

[1] "02-May-2017 10:10:41: some description"                       
[2] "02-May-2017 10:10:42: some description  some more  and more  "
[3] "02-May-2017 10:10:43: xyz"

正则表达式解释

\\d+ = digit(s)
- = -
\\D+ =非数字
\\s+ = white-space(s)
: = :