将字符串拆分为较小的字符串以在数据框中创建新行(在R中)

时间:2018-05-19 17:36:55

标签: r string dplyr

我是一个新的R用户,我目前正在努力解决如何在数据帧的每一行中拆分字符串,然后使用修改后的字符串创建一个新行(以及修改原始字符串)。这是下面的示例,但实际数据集要大得多。

library(dplyr)
library(stringr)
library(tidyverse)
library(utils)

posts_sentences <- data.frame("element_id" = c(1, 1, 2, 2, 2), "sentence_id" = c(1, 2, 1, 2, 3), 
                "sentence" = c("You know, when I grew up, I grew up in a very religious family, I had the same sought of troubles people have, I was excelling in alot of ways, but because there was alot of trouble at home, we were always moving around", "Im at breaking point.I have no one to talk to about this and if I’m honest I think I’m too scared to tell anyone because if I do then it becomes real.I dont know what to do.", "I feel like I’m going to explode.", "I have so many thoughts and feelings inside and I don't know who to tell and I was going to tell my friend about it but I'm not sure.", "I keep saying omg!it's too much"), 
                "sentence_wc" = c(60, 30, 7, 20, 7), stringsAsFactors=FALSE)

我想要分解超过某个字数的句子(此数据集为15),使用正则表达式从较长的句子中创建新句子,以便首先尝试按句点(或其他符号)分解),如果单词计数仍然太长,我会尝试使用逗号后跟一个I(或大写字母)然后我尝试'和'后面跟一个大写字母等等。每当我创建一个新句子时,它需要将句子从旧行改为句子的第一部分,同时更改单词计数(我有一个函数),同时创建一个具有相同元素id的新行,下一个序列的句子ID(如果sentence_id为1,则现在新句子为2),新句子计数然后将所有下面的句子改为下一句_句号。

我已经在这方面工作了几天,但无法弄清楚如何做到这一点。我尝试过使用不需要的令牌,str_split / extract以及各种dplyr过滤器,mutate等组合以及google / SO搜索。有谁知道实现这个目标的最佳方法? Dplyr是首选,但我对任何有效的东西持开放态度。如果您需要任何澄清,请随时提出问题!

编辑以添加预期的输出数据框:

expected_output <- data.frame("element_id" = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), "sentence_id" = c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6), 
                                   "sentence" = c("You know, when I grew up", "I grew up in a very religious family", "I had the same sought of troubles people have", "I was excelling in alot of ways, but because there was alot of trouble at home, we were always moving around", "Im at breaking point.", "I have no one to talk to about this and if I’m honest I think I’m too scared to tell anyone because if I do then it becomes real.", "I dont know what to do.", "I feel like I’m going to explode.", "I have so many thoughts and feelings inside and", "I don't know who to tell and", "I was going to tell my friend about it but I'm not sure.", "I keep saying omg!", "it's too much"), 
                                   "sentence_wc" = c(6, 8, 8, 21, 4, 27, 6, 7, 9, 7, 13, 4, 3), stringsAsFactors=FALSE)

5 个答案:

答案 0 :(得分:3)

编辑:我已经编辑了整个答案,以便更详细地解决具体问题。

这不完全是通用的,因为它假定这些组完全基于element_id

split_too_long <- function(str, max.words=15L, ...) {
  cuts <- stringi::stri_locate_all_words(str)[[1L]]

  # return one of these
  if (nrow(cuts) <= max.words) {
    c(str, NA_character_)
  }
  else {
    left <- substr(str, 1L, cuts[max.words, 2L])
    right <- substr(str, cuts[max.words + 1L, 1L], nchar(str))
    c(left, right)
  }
}

recursive_split <- function(not_done, done=NULL, ...) {
  left_right <- split_too_long(not_done, ...)

  # return one of these
  if (is.na(left_right[2L]))
    c(done, left_right[1L])
  else
    recursive_split(left_right[2L], done=c(done, left_right[1L]), ...)
}

collapse_split <- function(sentences, regex="[.;:] ?", ...) {
  sentences <- paste(sentences, collapse=". ")
  sentences <- unlist(strsplit(sentences, split=regex))
  # return
  unlist(lapply(sentences, recursive_split, done=NULL, ...))
}

group_fun <- function(grouped_df, ...) {
  # initialize new data frame with new number of rows
  new_df <- data.frame(sentence=collapse_split(grouped_df$sentence, ...),
                       stringsAsFactors=FALSE)
  # count words
  new_df$sentence_wc <- stringi::stri_count_words(new_df$sentence)
  # add sentence_id
  new_df$sentence_id <- 1L:nrow(new_df)
  # element_id must be equal because it is a grouping variable,
  # so take 1 to repeat it in output
  new_df$element_id <- grouped_df$element_id[1L]
  # return
  dplyr::filter(new_df, sentence_wc > 0L)
}

out <- posts_sentences %>%
  group_by(element_id) %>%
  do(group_fun(., max.words=5L, regex="[.;:!] ?"))

答案 1 :(得分:3)

这是一种pmap方法,允许您指定自己的启发式方法,我认为应该最适合您的情况。关键是使用map_if创建每行的列表,然后可以根据需要使用dplyr进行拆分。在我看来,这种情况很难单独使用rowwise,因为我们在操作中添加了行,因此split_too_long()很难使用。

dplyr::mutate的结构基本上是:

  1. 使用tokenizers::count_wordspurrr::pmap获取每个句子的字数
  2. 使每个行成为purrr::map_if列表的元素,该列接受数据帧作为列的列表作为输入
  3. 使用tidyr::separate_rows检查字数是否大于我们所需的限额
  4. 如果符合上述条件,请使用filter将句子拆分为多行,
  5. 然后将单词count替换为新的单词计数,并删除任何空行"[\\.\\?\\!] ?"(由双倍分隔符创建)。
  6. 然后我们可以将它应用于不同的分隔符,因为我们意识到元素需要进一步分割。在这里,我使用与您提到的启发式相对应的这些模式:

    • .!?匹配任意", ?(?=[:upper:])"和可选空格
    • ,匹配"and ?(?=[:upper:])",可选空格,大写字母前面
    • and匹配sentence_id可选空格,大写字母前面。

    它正确返回与预期输出相同的分割句子。 row_number很容易在最后添加stringr::str_trim,并且可以使用map删除错误的前导/尾随空格。

    注意事项:

    • 我为了探索性分析的可读性而写了这篇文章,因此每次分成列表并重新绑定在一起。如果你事先决定你想要什么分隔符,你可以将它放到一个split_too_long步骤中,这可能会使它更快,尽管我没有在大型数据集上对其进行分析。
    • 根据评论,这些分裂后仍然有超过15个单词的句子。您必须决定要拆分的其他符号/正则表达式,以便更长时间地缩短长度。
    • 目前,列名称已硬编码到programming with dplyr。我建议您查看posts_sentences <- data.frame( "element_id" = c(1, 1, 2, 2, 2), "sentence_id" = c(1, 2, 1, 2, 3), "sentence" = c("You know, when I grew up, I grew up in a very religious family, I had the same sought of troubles people have, I was excelling in alot of ways, but because there was alot of trouble at home, we were always moving around", "Im at breaking point.I have no one to talk to about this and if I’m honest I think I’m too scared to tell anyone because if I do then it becomes real.I dont know what to do.", "I feel like I’m going to explode.", "I have so many thoughts and feelings inside and I don't know who to tell and I was going to tell my friend about it but I'm not sure.", "I keep saying omg!it's too much"), "sentence_wc" = c(60, 30, 7, 20, 7), stringsAsFactors = FALSE ) library(tidyverse) library(tokenizers) split_too_long <- function(df, regexp, max_length) { df %>% mutate(wc = count_words(sentence)) %>% pmap(function(...) tibble(...)) %>% map_if( .p = ~ .$wc > max_length, .f = ~ separate_rows(., sentence, sep = regexp) ) %>% bind_rows() %>% mutate(wc = count_words(sentence)) %>% filter(wc != 0) } posts_sentences %>% group_by(element_id) %>% summarise(sentence = str_c(sentence, collapse = ".")) %>% ungroup() %>% split_too_long("[\\.\\?\\!] ?", 15) %>% split_too_long(", ?(?=[:upper:])", 15) %>% split_too_long("and ?(?=[:upper:])", 15) %>% group_by(element_id) %>% mutate( sentence = str_trim(sentence), sentence_id = row_number() ) %>% select(element_id, sentence_id, sentence, wc) #> # A tibble: 13 x 4 #> # Groups: element_id [2] #> element_id sentence_id sentence wc #> <dbl> <int> <chr> <int> #> 1 1 1 You know, when I grew up 6 #> 2 1 2 I grew up in a very religious family 8 #> 3 1 3 I had the same sought of troubles people ~ 9 #> 4 1 4 I was excelling in alot of ways, but beca~ 21 #> 5 1 5 Im at breaking point 4 #> 6 1 6 I have no one to talk to about this and i~ 29 #> 7 1 7 I dont know what to do 6 #> 8 2 1 I feel like I’m going to explode 7 #> 9 2 2 I have so many thoughts and feelings insi~ 8 #> 10 2 3 I don't know who to tell 6 #> 11 2 4 I was going to tell my friend about it bu~ 13 #> 12 2 5 I keep saying omg 4 #> 13 2 6 it's too much 3 插图,如果能够在调用函数中指定列名对您来说很重要(实现它只需要进行一些调整)
    {{1}}

    reprex package(v0.2.0)创建于2018-05-21。

答案 2 :(得分:1)

此解决方案首先在大写字母前用逗号或句号分隔句子。然后用逗号和句号分割句子。最后,如果句子仍然高于限制词。这些句子被每个人都分开了。

posts_sentences <- data.frame("element_id" = c(1, 1, 2, 2, 2), "sentence_id" = c(1, 2, 1, 2, 3), 
                              "sentence" = c("You know, when I grew up, I grew up in a very religious family, I had the same sought of troubles people have, I was excelling in alot of ways, but because there was alot of trouble at home, we were always moving around", "Im at breaking point.I have no one to talk to about this and if I’m honest I think I’m too scared to tell anyone because if I do then it becomes real.I dont know what to do.", "I feel like I’m going to explode.", "I have so many thoughts and feelings inside and I don't know who to tell and I was going to tell my friend about it but I'm not sure.", "I keep saying omg!it's too much"), 
                              "sentence_wc" = c(60, 30, 7, 20, 7), stringsAsFactors=FALSE)

# To create an empty data frame to save the new elements

new_posts_sentences <- data.frame(element_id = as.numeric(),
                 sentence_id =as.numeric(), 
                 sentence = character(), 
                 sentence_wc = as.numeric(),  stringsAsFactors=FALSE) 

limit_words <- 15 # 15 for this data set

countSentences <- 0

for (sentence in posts_sentences[,3]) {

        vector <- character()

        Velement_id <- posts_sentences$element_id[countSentences + 1]

        vector <- c(vector, sentence) #To create a vector with the sentences
        vector <- vector[!vector %in% ''] #remove empty elements from vector

        ## First we will separate the sentences that start with a uppercase after of a capital letter
        if(lengths(gregexpr("[A-z]\\W+", sentence)) > limit_words ){

                vector <- vector[!vector %in% sentence]

                split_points <- unlist(gregexpr("[:,:]\\s[A-Z]", sentence)) # To get the character position

                ## If a sentences is still over the limit words value. Let's split it for each comma or period
                sentences_1 <- substring(sentence, c(1, split_points + 2), c(split_points -1, nchar(sentence)))

                for(sentence in sentences_1){

                        vector <- c(vector, sentence)
                        vector <- vector[!vector %in% '']

                        if(lengths(gregexpr("[A-z]\\W+", sentence)) > limit_words){

                                vector <- vector[!vector %in% sentence]

                                split_points <- unlist(gregexpr("[:,:]|[:.:]", sentence))

                                sentences_2 <- substring(sentence, c(1, split_points + 1), c(split_points -1, nchar(sentence)))

                                ## If a sentence is still s still over the limit words value. Let's split it for each capital letter

                                for(sentence in sentences_2){

                                        vector <- c(vector, sentence)
                                        vector <- vector[!vector %in% '']

                                        if(lengths(gregexpr("[A-z]\\W+", sentence)) > limit_words){

                                                vector <- vector[!vector %in% sentence]

                                                split_points <- unlist(gregexpr("[A-Z]", sentence))

                                                sentences_3 <- substring(sentence,c(1, split_points), c(split_points -1, nchar(sentence)))

                                                vector <- c(vector, sentences_3)
                                                vector <- vector[!vector %in% '']

                                        }

                                }

                        }

                }

        }

        ## To make a data frame o each original sentence
        element_id <- rep(Velement_id, length(vector))
        sentence_id <- 1:length(vector)
        sentence_wc <- character()
        for (element in vector){sentence_wc <- c(sentence_wc, (lengths(gregexpr("[A-z]\\W+", element)))) }
        sentenceDataFrame <- data.frame(element_id, sentence_id, vector, sentence_wc)       

        ## To join it with the final dataframe
        new_posts_sentences <- rbind(new_posts_sentences, sentenceDataFrame)

        countSentences <- countSentences + 1

}

您获得此数据框

print(new_posts_sentences)

   element_id sentence_id                                           vector sentence_wc
1           1           1                         You know, when I grew up           5
2           1           2             I grew up in a very religious family           7
3           1           3    I had the same sought of troubles people have           8
4           1           4                  I was excelling in alot of ways           6
5           1           5    but because there was alot of trouble at home           8
6           1           6                     we were always moving around           4
7           1           1                             Im at breaking point           3
8           1           2      I have no one to talk to about this and if           11
9           1           3                                      I’m honest            3
10          1           4                                         I think            2
11          1           5        I’m too scared to tell anyone because if            9
12          1           6                        I do then it becomes real           5
13          1           7                           I dont know what to do           5
14          2           1                I feel like I’m going to explode.           8
15          2           1 I have so many thoughts and feelings inside and            9
16          2           2                    I don't know who to tell and            8
17          2           3      I was going to tell my friend about it but           10
18          2           4                                     I'm not sure           3
19          2           1                  I keep saying omg!it's too much           7

我希望它有所帮助。

答案 3 :(得分:0)

替代tidyverse解决方案:

library(dplyr)
library(tidyr)
library(stringr)
library(tidyverse)
library(utils)

check_and_split <- function(element_id, sentence_id, sentence, sentence_wc,
                             word_count, attmpt){

  methods <- c("\\.", ",\\s?(?=[I])", "and\\s?(?=[A-Z])")
  df <- data.frame(element_id=element_id,
             sentence_id=sentence_id,
             sentence=sentence,
             sentence_wc=sentence_wc,
             word_count=word_count,
             attmpt=attmpt,
             stringsAsFactors = FALSE)

    if(word_count<=15 | attmpt>=3){
      return(df) #early return
    } else{
     df %>% 
        tidyr::separate_rows(sentence, sep=methods[attmpt+1]) %>% 
        mutate(word_count=str_count(sentence,'\\w+'),
               attmpt = attmpt+1)
    }
}

posts_sentences %>% 
  mutate(word_count=str_count(sentence,'\\w+'),
         attmpt=0) %>%
  pmap_dfr(check_and_split) %>% 
  pmap_dfr(check_and_split) %>% 
  pmap_dfr(check_and_split) 

在这里,我们创建一个辅助函数,它接受一行(由元素分解,由purrr::pmap()提供),我们将它组装回数据帧,检查字数是否超过15并尝试尝试次数在之前的句子。然后,我们将tidyr::separate_rows()与对应下次尝试的分隔令牌一起使用,更新word_countnumber of attempts并返回数据框。

我正在应用相同的函数三次 - 这可能被包装成一个循环(lapply / purrr :: map将无法工作,因为我们需要更新顺序更新数据帧)。

就正则表达式令牌而言,首先我们使用文字.,然后我们跟踪逗号和零个或多个空格,然后是“I”。请注意正向前瞻语法。最后,我们尝试“和”,可能是空间,先行后跟大写字母。

希望这是有道理的

答案 4 :(得分:0)

我认为最简单的方法是使用stringr包中的str_split()函数(根据您的正则表达式分割每个文本块),然后使用tidyr包中的unnest()函数。

sentences_split = posts_sentences %>%
  mutate(text_split=str_split(sentence, pattern = "\\.")) %>%
  unnest(text_split) %>%

  #Count number of words in text_split
  mutate(wc_split = str_count(text_split, "\\w+")) %>%

  filter(wc_split!=0) %>%

  #Split again if text_split column has >15 words
  mutate(text_split_again = ifelse(wc_split>15, str_split(text_split, pattern = ",\\sI"), text_split)) %>%
  unnest(text_split_again)