如何在R中输入一定数量的单词和数字之后拆分一串文本?

时间:2016-02-05 08:30:27

标签: r split strsplit stringr

我想在遇到时间之后将我的文字分成8个单词和数字。

案文示例:

s <- 'random random random 19:49 0-2 H 2 ABC TREE LAKE #88 TURTLE random random 03:32 43-21 V 8 XYZ LOG #72 FIRE random random random'

我希望如何分割文本的示例。

 'random random random 19:49 0-2 H 2 ABC TREE LAKE #88 TURTLE
  random random 03:32 43-21 V 8 XYZ DOG LOG #72 FIRE
  random random random'

我知道我可以通过多种方式找到时间,例如

str_extract(str_extract(s, "[:digit:]*:"), "[:digit:]*")

但我不确定如何在时间之后分割八个单词和数字。 任何帮助将不胜感激。

3 个答案:

答案 0 :(得分:5)

我们可以在一个或多个空格(\\s+)的8个实例后跟一个或多个非空格(\\S+)(在:之后)替换后面的空格接着是2位数字,其中,然后split在该分隔符上。

strsplit(gsub('((?:\\:\\d{2}(\\s+\\S+){8}))\\s', '\\1,', 
            s, perl=TRUE), ',')[[1]]
#[1] "random random random 19:49 0-2 H 2 ABC TREE LAKE #88 TURTLE"
#[2] "random random 03:32 43-21 V 8 XYZ DOG LOG #72 FIRE"         
#[3] "random random random"      

数据

s <- 'random random random 19:49 0-2 H 2 ABC TREE LAKE #88 TURTLE random random 03:32 43-21 V 8 XYZ DOG LOG #72 FIRE random random random'

答案 1 :(得分:1)

使用for循环来管理不同的案例(我希望我评论得足够多,随时可以询问是否有不明确的事情):

s <- 'random random random 19:49 0-2 H 2 ABC TREE LAKE #88 TURTLE random random 03:32 43-21 V 8 XYZ LOG #72 FIRE random random random'
as <- strsplit(s," ")[[1]] # Split the string on space to get the words
nwords <- length(as) # count them (will be reused later)
timepos <- c(grep('\\d+:\\d+',as),nwords) # find the position where it's time, add 1 for last line

start = 1 # initalize start position
lines <- vector('list',length(timepos)) # initialize lines list to avoid growing it in loop

for (i in seq_along(timepos)) { # loop over the lines we need
  end<-timepos[i]+8 # compute the end
  if (end > nwords) end <- nwords # sanity check, if we're larger than the number of word, just get the end

  lines[[i]]<-paste0(as[start:end],collapse=" ") # make the line

  start<-end+1 # Update the next start of line
  if (start > nwords) break # If we're over the number of words, stop.
}
result <- paste(lines)

输出:

[1] "random random random 19:49 0-2 H 2 ABC TREE LAKE #88 TURTLE"
[2] "random random 03:32 43-21 V 8 XYZ LOG #72 FIRE random"      
[3] "random random"  

答案 2 :(得分:0)

s = 'random random random 19:49 0-2 H 2 ABC 19:49 LAKE #88 TURTLE random random 03:32 43-21 V 8 XYZ LOG #72 FIRE random random random'

splitted = strsplit(s, ' ')[[1]]
# [1] "random" "random" "random" "19:49"  "0-2"    "H"      "2"      "ABC"    "19:49"  "LAKE"   "#88"   
# [12] "TURTLE" "random" "random" "03:32"  "43-21"  "V"      "8"      "XYZ"    "LOG"    "#72"    "FIRE"  
# [23] "random" "random" "random"


# find two digits + colon + two digits, `^` means begin of string, `$` means end of string
where_time = which( grepl('^\\d{2}:\\d{2}$', splitted) )
# 4     9    15 
where_to_break = where_time + 8
# 12    17    23 


# if time2 is between time1 and the break of time1, don't break for time2
for (ii in 1:(length(where_time)-1)){

        if(is.na(where_time[ii])){
                next
        }
        between = where_time[ii] < where_time & where_time < where_to_break[ii]
        where_time[between] = NA
}
where_time = where_time[!is.na(where_time)]
where_to_break = where_time + 8
# 12    23 


# if a planned break is after the end of text, it's unnecessary
where_to_break = where_to_break[ where_to_break < length(splitted) ]
# 12    23


s2 = vector('character', length(where_to_break)+1)

# recombine line 1
s2[1] = paste(splitted[ 1:where_to_break[1] ], collapse = ' ')

# last line
s2[(length(s2))] = paste(splitted[ where_to_break[length(where_to_break)]:length(splitted) ], collapse = ' ')

# other lines
for (ii in 2:(length(s2)-1)){
        s2[ii] = paste(splitted[ where_to_break[ii-1]:where_to_break[ii] ], collapse = ' ')
}

# recombine lines
s3 = paste(s2, collapse = '\n') 
cat(s3)
# random random random 19:49 0-2 H 2 ABC 19:49 LAKE #88 TURTLE
# TURTLE random random 03:32 43-21 V 8 XYZ LOG #72 FIRE random
# random random random