我需要将以下字符解析为3行。 每行以时间戳开头,但可以选择跨越多个后续行:
e.g。
x <- c('02-May-2017 10:10:41: some description
02-May-2017 10:10:42: some description
some more
and more
02-May-2017 10:10:43: xyz')
基本方法是搜索时间戳模式的第一个匹配项,记住起始位置并搜索下一个 时间戳从上一场比赛结束开始,并提取两者之间的租船人。
任何想法是否有一种有效的方法来实现这一目标。
顺便说一下,所需的输出是:
[1] 02-May-2017 10:10:41: some description
[2] 02-May-2017 10:10:42: some description some more and more
[3] 02-May-2017 10:10:43: xyz
答案 0 :(得分:3)
这是我的尝试:
# read in line
x <- c('02-May-2017 10:10:41: some description
02-May-2017 10:10:42: some description
some more
and more
02-May-2017 10:10:43: xyz')
# remove line breaks
x <- gsub("\n", " ", x)
# regex pattern for timestamp
pattern <- "[0-9]{2}-[A-Z][a-z]{2}-[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}"
# split lines without time stamps
x.lines <- strsplit(x, pattern)[[1]][-1]
# corresponding time stamps
x.stamps <- stringr::str_extract_all(x, pattern)[[1]]
lapply(seq_along(x.stamps), function(i) {paste0(x.stamps[i], x.lines[i])}) %>% unlist()
[1] "02-May-2017 10:10:41: some description "
[2] "02-May-2017 10:10:42: some description some more and more "
[3] "02-May-2017 10:10:43: xyz"
答案 1 :(得分:2)
这有效:
res = strsplit(x, "\\s+(?=\\d{2}\\-)", perl=TRUE)[[1]]
[1] "02-May-2017 10:10:41: some description"
[2] "02-May-2017 10:10:42: some description\n some more\n and more"
[3] "02-May-2017 10:10:43: xyz"
如果需要,您可以在\n
之后使用gsub删除gsub("\\n ", "", res)
或类似内容。
这取决于01-
或类似的“更多”行。如果他们这样做,?=
模式可以扩展为更独特。
答案 2 :(得分:1)
library(stringr)
y <- str_replace_all(x, "\\n", "") # remove next lines
words <- unlist(str_split(y, "\\d+-\\D+-\\d+\\s+\\d+:\\d+:\\d+:")) # uses regex to split string at date-like strings
words <- words[words!=""] # remove empty string == ""
timestamps <- unlist(str_extract_all(x, "\\d+-\\D+-\\d+\\s+\\d+:\\d+:\\d+:")) # extracts date-like strings
paste0(timestamps, words)
[1] "02-May-2017 10:10:41: some description"
[2] "02-May-2017 10:10:42: some description some more and more "
[3] "02-May-2017 10:10:43: xyz"
\\d+
= digit(s)
-
= -
\\D+
=非数字
\\s+
= white-space(s)
:
= :