将多行字符串连接成一个字符串

时间:2017-11-15 09:37:43

标签: r regex

我试图解析看起来像这样的日志文件:

24/01/2016, 11:50:17 pm: ‎Line to skip
24/01/2016, 11:50:17 pm: ‎Line to skip
25/01/2016, 11:51:47 pm: User1: Message one is here
25/01/2016, 11:53:04 pm: User2: A long message that spans multiple lines, so I have to write a really long and tedious message here to illustrate my point. The point is that this message is really long and 

can
[span]

Several lines.
24/01/2016, 11:51:47 pm: User3: My first message
27/10/2017, 12:54:03 am: ‎‪+44 ‬012 3456789 left
28/10/2017, 02:54:03 pm: User3: My second message!
rawData <- structure(list(V1 = c("24 01 2016, 11:50:17 pm: ‎Line to skip", 
        "24 01 2016, 11:50:17 pm: ‎Line to skip", "24 01 2016, 11:51:47 pm: User1: Message one is here", 
        "24 01 2016, 11:53:04 pm: User2: A long message that spans multiple lines, so I have to write a really long and tedious message here to illustrate my point. The point is that this message is really long and ", 
        "can", "[span]", "Several lines.", "24 01 2016, 11:51:47 pm: User3: My first message", 
        "27 10 2017, 12:54:03 am: ‎‪+44 ‬012 3456789 left")), .Names = "V1", row.names = c(NA, 
        -9L), class = "data.frame")

每条消息都以日期开头,我无法解析跨越多行的消息(如第4行)。

这是我到目前为止所拥有的:

suppressMessages(library(lubridate))
suppressMessages(library(dplyr)) 
suppressMessages(library(plyr))
suppressMessages(library(tidyr))

parseR <- function(file='data/chat_log.txt',drop="44"){
  rawData <- read.delim(file, quote = "", 
                  row.names = NULL, 
                  stringsAsFactors = FALSE,
                  header = F)


  # remove blank lines
  # rawData<-rawData[!apply(rawData == "", 1, all),]

  empty_lines = grepl('^\\s*$', rawData)
  rawData = rawData[! empty_lines]

  # join multi line messages into single line
  # rawData$V1<-gsub("[\r\n]", " ", rawData$V2)

  sepData<-suppressWarnings(separate(rawData, V1, c("datetime", "sender", "message"), sep = ": ", extra = "merge"))

  sepData$message <- trimws(sepData$message)
  sepData$sender<-factor(sepData$sender)

  data <- sepData %>% 
    filter(!is.na(message)) %>%
    filter(!grepl(drop, sender)) %>%
    droplevels() 

  cleanData<-separate(data, datetime, c("date", "time"), sep = " ", remove =TRUE)
  cleanData$date<-ymd(cleanData$date)
  cleanData$time<-hms(cleanData$time)

  return(cleanData)
}

但是,当我检查返回的数据框时,没有正确解析多个行消息:

> head(parseR())
        date        time sender                                                                                                                                                                       message
1 2016-01-25 23H 51M 47S  User1                                                                                                                                                           Message one is here
2 2016-01-25  23H 53M 4S  User2 A long message that spans multiple lines, so I have to write a really long and tedious message here to illustrate my point. The point is that this message is really long and
3 2016-01-24 23H 51M 47S  User3                                                                                                                                                              My first message
4 2017-10-28  14H 54M 3S  User3   

有人可以建议一种方法来删除空行并将不以日期开头的文本连接到一个字符串中吗?

第4行的所需格式:

25/01/2016, 11:53:04 pm: User2: A long message that spans multiple lines, so I have to write a really long and tedious message here to illustrate my point. The point is that this message is really long and can [span] Several lines.

2 个答案:

答案 0 :(得分:2)

这是解决这个问题的穴居人的方法。我将时间戳作为行标识符的唯一开头。如果不存在,则将行(或元素)粘贴到上一行。下面的示例适用于矢量,但可以很容易地将其更改为适用于其他类,例如矩阵或data.frames。

rd <- c("24 01 2016, 11:50:17 pm: Line to skip", 
        "24 01 2016, 11:50:17 pm: Line to skip", "24 01 2016, 11:51:47 pm: User1: Message one is here", 
        "24 01 2016, 11:53:04 pm: User2: A long message that spans multiple lines, so I have to write a really long and tedious message here to illustrate my point. The point is that this message is really long and ", 
        "can", "[span]", "Several lines.", "24 01 2016, 11:51:47 pm: User3: My first message", 
        "27 10 2017, 12:54:03 am: ‪+44 ‬012 3456789 left")
rd

out <- rep(NA, length(rd))

gr <- 1
for (i in 1:length(rd)) {
  # if starting with timestamp, save into out and move on (gr)
  find.startline <- grepl("^\\d{2} \\d{2} \\d{4}, \\d{2}:\\d{2}:\\d{2} (am|pm):", rd[i])
  if (find.startline) {
    out[gr] <- rd[i]
    gr <- gr + 1
  }

  if (!find.startline) {
    # if doesn't start with timestamp, append to previous (ss)
    ss <- gr - 1
    out[ss] <- paste(out[ss], rd[i])
  }
}

# if there are any multiline comments, some residual NAs should be present, removed
out <- out[!is.na(out)]
out

[1] "24 01 2016, 11:50:17 pm: Line to skip"                                                                                                                                                                                                   
[2] "24 01 2016, 11:50:17 pm: Line to skip"                                                                                                                                                                                                   
[3] "24 01 2016, 11:51:47 pm: User1: Message one is here"                                                                                                                                                                                     
[4] "24 01 2016, 11:53:04 pm: User2: A long message that spans multiple lines, so I have to write a really long and tedious message here to illustrate my point. The point is that this message is really long and  can [span] Several lines."
[5] "24 01 2016, 11:51:47 pm: User3: My first message"                                                                                                                                                                                        
[6] "27 10 2017, 12:54:03 am: *+44 ,012 3456789 left" 

答案 1 :(得分:1)

我提出类似罗马解决方案的建议,但在Tidyverse世界:

rawData %>%
  mutate( 
    MgsNo = (!substr( V1, 1, 1) %>% # take first character
                 as.numeric %>% # convert to numeric - produces NAs for non-numeric values
                 is.na) %>% # produces True(1) and False(0) ( by ! I reverse those)
      cumsum ) %>% # then cumulative sum as Mgs NO e.g. 1,1,1,0,0,1 -> 1,2,3,3,3,4
  group_by( MgsNo) %>% 
  do( MgsBody = paste( .$V1 , collapse = "")) %>% # concatenate all in each MgsNo group 
  select( MgsBody) %>%
  pull