如何从数据框中的文本中提取第一段?

时间:2017-10-23 20:13:34

标签: r dplyr stringr

考虑此数据框

library(dplyr)
library(stringr)


mydf <- data_frame(text = c('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. \nUt enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. \nDuis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum',
                            'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. \nUt enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. \nDuis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum',
                            'this is a short text without paragraphs! HA!!!'))

我想创建一个列first_paragraphs,其中只包含mytext列中存储的文本的前两段。如您所见,有时甚至没有一个段落(第3行)。在这种情况下,将文本保持原样是可以的。

我尝试了以下内容,但没有成功。

#this function finds the position of the second \n in the data
myend <- function(text){
 myend <- str_locate_all(text, "\n")[[2]] %>% as_tibble() %>% pull(end) 
 myend
}

mydf <-mydf %>% mutate(thresh = myend(text),
                       #here I only keep text until that threshold
                       first_paragraphs= str_sub(text, 1, thresh))

Error in mutate_impl(.data, dots) : 
  Evaluation error: subscript out of bounds.

这里有什么问题?

预期输出为:

data_frame(text = c('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. \nUt enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. ',
                    'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. \nUt enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. ',
                    'this is a short text without paragraphs! HA!!!'))

非常感谢!

2 个答案:

答案 0 :(得分:2)

以下是strsplit的基本R解决方案:

mydf$firstparagraph = paste(strsplit(mydf$text, "\n")[[1]][1:2], collapse = "\n")

<强>结果:

> mydf$firstparagraph
[1] "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. \nUt enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. "

修改

使用OP的更新数据集,下面是一种提取text每行前两段的方法:

mydf$firstparagraph = sapply(strsplit(mydf$text, "\n"), 
                             function(x) sub("\nNA$", "", paste(x[1:2], collapse = "\n")))

为了更好的可读性,您可以使用dplyr

中的管道
library(dplyr)

mydf$text %>%
  strsplit("\n") %>%
  sapply(function(x){
    x[1:2] %>%
      paste(collapse = "\n") %>%
      sub("\nNA$", "", .)
  })

tidyverse

library(stringr)
library(purrr)

mydf %>%
  mutate(firstparagraph = map(strsplit(text, "\n"), ~{
    .[1:2] %>% 
      paste(collapse = "\n") %>% 
      str_replace("\nNA$", "")
  }))

<强>结果:

> mydf$firstparagraph
[1] "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. \nUt enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. "
[2] "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. \nUt enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. "
[3] "this is a short text without paragraphs! HA!!!" 

sapply是必需的,因为列text现在有多行,因此strsplit将输出一个列表,其中每个元素对应text中的一行。 sub用于删除少于两个段落的行的额外\nNA

答案 1 :(得分:1)

这将为您提供变量“first_paragraphs”中的前两个段落,以及“thresh变量”:

mydf <- data_frame(text = paste0(
  'Lorem ipsum dolor sit amet, '
  'consectetur adipiscing elit, '
  'sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. '
  '\nUt enim ad minim veniam, '
  'quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. '
  '\nDuis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. '
  'Excepteur sint occaecat cupidatat non proident, '
  'sunt in culpa qui officia deserunt mollit anim id est laborum'))

mydf <- mydf %>% mutate(thresh = str_locate_all(mydf$text, "\n")[[1]][2, 2],
                        first_paragraphs = str_sub(text, 1, thresh))