合并(折叠)字符串向量的一些元素但不是全部在R中

时间:2018-04-21 12:44:52

标签: r string nlp stringi

我有一个非常奇特的问题,我无法处理。我有一个字符串向量,每个元素代表一个小说的句子。

我需要做的只是折叠同一对话内的那些线。例如,采取以下行:

snap <- c("It was a few seconds before Mr Dursley realised that the man was wearing a violet cloak.",
      "He didn't seem at all upset at being almost knocked to the ground.",
      "On the contrary, his face split into a wide smile and he said in a squeaky voice that made passers-by stare: \"Don't be sorry, my dear sir, for nothing could upset me today!",
      "Rejoice, for You-Know-Who has gone at last!",
      "Even Muggles like yourself should be celebrating, this happy, happy day!\"",
      "And the old man hugged Mr Dursley around the middle and walked off."
      )

第3行到第5行属于同一个对话框,因此它们必须折叠,结果向量为:

snap.2 <- c("It was a few seconds before Mr Dursley realised that the man was wearing a violet cloak.",
      "He didn't seem at all upset at being almost knocked to the ground.",
      "On the contrary, his face split into a wide smile and he said in a squeaky voice that made passers-by stare: \"Don't be sorry, my dear sir, for nothing could upset me today! Rejoice, for You-Know-Who has gone at last! Even Muggles like yourself should be celebrating, this happy, happy day!\"",
      "And the old man hugged Mr Dursley around the middle and walked off."
      )

我可以用以下方法检测不平衡的双引号:

which((str_count(snap, "\"") %% 2) != 0)
[3 5]

但后来我不知道如何合并,如上面的例子,第3,4和5行

关于如何做到这一点的任何想法?

2 个答案:

答案 0 :(得分:1)

我们可以paste将它们放在一起,然后根据正则表达式进行拆分

out <- strsplit(paste(snap, collapse=' '), '(?<=\\.)\\s*|(?<=["])\\s', perl = TRUE)[[1]]
identical(out, snap.2)
#[1] TRUE

注意:模式不清楚。

答案 1 :(得分:1)

它可能不是最好的方法(非常难看的代码),但它有效。基本上是:

  1. 成对分割which的输出(情侣将代表对话开始和结束偏移)
  2. 使用dplyr
  3. 中的超前和滞后查找上一个和下一个对话的相关内容
  4. 填充虚拟夫妇的空白,其中dialog.start = dialog.end for not dialogue lines
  5. 使用输出数据集作为粘贴的索引
  6. 在代码中:

    dialogue.start <- which((str_count(snap, "\"") %% 2) != 0)
    
    quotes.fill <- data.frame(dialogue.start) %>%
      mutate(n = row_number())
    
    quotes.fill$dialogue.end <- ifelse((quotes.fill$n %% 2) != 0, lead(quotes.fill$dialogue.start, 1), NA)
    quotes.fill$dialogue.next <- ifelse((quotes.fill$n %% 2) != 0, lead(quotes.fill$dialogue.start, 2, default = NROW(snap)), NA)
    quotes.fill$dialogue.before <- ifelse((quotes.fill$n %% 2) != 0, lag(quotes.fill$dialogue.start, 2, default = 0), NA)
    
    
    quotes.fill <- quotes.fill %>% filter(!is.na(dialogue.end)) %>%
      select(-n)
    
    quotes.gaps <- do.call(rbind, lapply(split(quotes.fill, seq(nrow(quotes.fill))), function(x) { 
    
      prologue <- NULL
    
      dialogue.hold <- seq(to = (x$dialogue.next - 1), from = (x$dialogue.end + 1))
      dialogue.prologue <- seq(to = (x$dialogue.start - 1), from = (x$dialogue.before + 1))
    
      if(x$dialogue.before == 0 & x$dialogue.start > 0) prologue <- data.frame(dialogue.start = dialogue.prologue, dialogue.end = dialogue.prologue, stringsAsFactors = FALSE)
    
      if((x$dialogue.end + 1) >= x$dialogue.next) return(rbind(prologue, x[,c("dialogue.start", "dialogue.end")]))
    
    
      return(rbind(prologue, x[,c("dialogue.start", "dialogue.end")], data.frame(dialogue.start = dialogue.hold, dialogue.end = dialogue.hold, stringsAsFactors = FALSE)))
    })
    )
    
    snap.2 <- do.call(c, lapply(split(quotes.gaps, seq(nrow(quotes.gaps))), function(c, novel) {
      paste(novel[c$dialogue.start:c$dialogue.end], collapse = " ")
    }, novel = snap))