我是一个新的R用户,我目前正在努力解决如何在数据帧的每一行中拆分字符串,然后使用修改后的字符串创建一个新行(以及修改原始字符串)。这是下面的示例,但实际数据集要大得多。
library(dplyr)
library(stringr)
library(tidyverse)
library(utils)
posts_sentences <- data.frame("element_id" = c(1, 1, 2, 2, 2), "sentence_id" = c(1, 2, 1, 2, 3),
"sentence" = c("You know, when I grew up, I grew up in a very religious family, I had the same sought of troubles people have, I was excelling in alot of ways, but because there was alot of trouble at home, we were always moving around", "Im at breaking point.I have no one to talk to about this and if I’m honest I think I’m too scared to tell anyone because if I do then it becomes real.I dont know what to do.", "I feel like I’m going to explode.", "I have so many thoughts and feelings inside and I don't know who to tell and I was going to tell my friend about it but I'm not sure.", "I keep saying omg!it's too much"),
"sentence_wc" = c(60, 30, 7, 20, 7), stringsAsFactors=FALSE)
我想要分解超过某个字数的句子(此数据集为15),使用正则表达式从较长的句子中创建新句子,以便首先尝试按句点(或其他符号)分解),如果单词计数仍然太长,我会尝试使用逗号后跟一个I(或大写字母)然后我尝试'和'后面跟一个大写字母等等。每当我创建一个新句子时,它需要将句子从旧行改为句子的第一部分,同时更改单词计数(我有一个函数),同时创建一个具有相同元素id的新行,下一个序列的句子ID(如果sentence_id为1,则现在新句子为2),新句子计数然后将所有下面的句子改为下一句_句号。
我已经在这方面工作了几天,但无法弄清楚如何做到这一点。我尝试过使用不需要的令牌,str_split / extract以及各种dplyr过滤器,mutate等组合以及google / SO搜索。有谁知道实现这个目标的最佳方法? Dplyr是首选,但我对任何有效的东西持开放态度。如果您需要任何澄清,请随时提出问题!
编辑以添加预期的输出数据框:
expected_output <- data.frame("element_id" = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), "sentence_id" = c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6),
"sentence" = c("You know, when I grew up", "I grew up in a very religious family", "I had the same sought of troubles people have", "I was excelling in alot of ways, but because there was alot of trouble at home, we were always moving around", "Im at breaking point.", "I have no one to talk to about this and if I’m honest I think I’m too scared to tell anyone because if I do then it becomes real.", "I dont know what to do.", "I feel like I’m going to explode.", "I have so many thoughts and feelings inside and", "I don't know who to tell and", "I was going to tell my friend about it but I'm not sure.", "I keep saying omg!", "it's too much"),
"sentence_wc" = c(6, 8, 8, 21, 4, 27, 6, 7, 9, 7, 13, 4, 3), stringsAsFactors=FALSE)
答案 0 :(得分:3)
编辑:我已经编辑了整个答案,以便更详细地解决具体问题。
这不完全是通用的,因为它假定这些组完全基于element_id
。
split_too_long <- function(str, max.words=15L, ...) {
cuts <- stringi::stri_locate_all_words(str)[[1L]]
# return one of these
if (nrow(cuts) <= max.words) {
c(str, NA_character_)
}
else {
left <- substr(str, 1L, cuts[max.words, 2L])
right <- substr(str, cuts[max.words + 1L, 1L], nchar(str))
c(left, right)
}
}
recursive_split <- function(not_done, done=NULL, ...) {
left_right <- split_too_long(not_done, ...)
# return one of these
if (is.na(left_right[2L]))
c(done, left_right[1L])
else
recursive_split(left_right[2L], done=c(done, left_right[1L]), ...)
}
collapse_split <- function(sentences, regex="[.;:] ?", ...) {
sentences <- paste(sentences, collapse=". ")
sentences <- unlist(strsplit(sentences, split=regex))
# return
unlist(lapply(sentences, recursive_split, done=NULL, ...))
}
group_fun <- function(grouped_df, ...) {
# initialize new data frame with new number of rows
new_df <- data.frame(sentence=collapse_split(grouped_df$sentence, ...),
stringsAsFactors=FALSE)
# count words
new_df$sentence_wc <- stringi::stri_count_words(new_df$sentence)
# add sentence_id
new_df$sentence_id <- 1L:nrow(new_df)
# element_id must be equal because it is a grouping variable,
# so take 1 to repeat it in output
new_df$element_id <- grouped_df$element_id[1L]
# return
dplyr::filter(new_df, sentence_wc > 0L)
}
out <- posts_sentences %>%
group_by(element_id) %>%
do(group_fun(., max.words=5L, regex="[.;:!] ?"))
答案 1 :(得分:3)
这是一种pmap
方法,允许您指定自己的启发式方法,我认为应该最适合您的情况。关键是使用map_if
创建每行的列表,然后可以根据需要使用dplyr
进行拆分。在我看来,这种情况很难单独使用rowwise
,因为我们在操作中添加了行,因此split_too_long()
很难使用。
dplyr::mutate
的结构基本上是:
tokenizers::count_words
和purrr::pmap
获取每个句子的字数purrr::map_if
列表的元素,该列接受数据帧作为列的列表作为输入tidyr::separate_rows
检查字数是否大于我们所需的限额filter
将句子拆分为多行,"[\\.\\?\\!] ?"
(由双倍分隔符创建)。然后我们可以将它应用于不同的分隔符,因为我们意识到元素需要进一步分割。在这里,我使用与您提到的启发式相对应的这些模式:
.!?
匹配任意", ?(?=[:upper:])"
和可选空格,
匹配"and ?(?=[:upper:])"
,可选空格,大写字母前面and
匹配sentence_id
可选空格,大写字母前面。它正确返回与预期输出相同的分割句子。 row_number
很容易在最后添加stringr::str_trim
,并且可以使用map
删除错误的前导/尾随空格。
注意事项:
split_too_long
步骤中,这可能会使它更快,尽管我没有在大型数据集上对其进行分析。programming with dplyr
。我建议您查看posts_sentences <- data.frame(
"element_id" = c(1, 1, 2, 2, 2), "sentence_id" = c(1, 2, 1, 2, 3),
"sentence" = c("You know, when I grew up, I grew up in a very religious family, I had the same sought of troubles people have, I was excelling in alot of ways, but because there was alot of trouble at home, we were always moving around", "Im at breaking point.I have no one to talk to about this and if I’m honest I think I’m too scared to tell anyone because if I do then it becomes real.I dont know what to do.", "I feel like I’m going to explode.", "I have so many thoughts and feelings inside and I don't know who to tell and I was going to tell my friend about it but I'm not sure.", "I keep saying omg!it's too much"),
"sentence_wc" = c(60, 30, 7, 20, 7), stringsAsFactors = FALSE
)
library(tidyverse)
library(tokenizers)
split_too_long <- function(df, regexp, max_length) {
df %>%
mutate(wc = count_words(sentence)) %>%
pmap(function(...) tibble(...)) %>%
map_if(
.p = ~ .$wc > max_length,
.f = ~ separate_rows(., sentence, sep = regexp)
) %>%
bind_rows() %>%
mutate(wc = count_words(sentence)) %>%
filter(wc != 0)
}
posts_sentences %>%
group_by(element_id) %>%
summarise(sentence = str_c(sentence, collapse = ".")) %>%
ungroup() %>%
split_too_long("[\\.\\?\\!] ?", 15) %>%
split_too_long(", ?(?=[:upper:])", 15) %>%
split_too_long("and ?(?=[:upper:])", 15) %>%
group_by(element_id) %>%
mutate(
sentence = str_trim(sentence),
sentence_id = row_number()
) %>%
select(element_id, sentence_id, sentence, wc)
#> # A tibble: 13 x 4
#> # Groups: element_id [2]
#> element_id sentence_id sentence wc
#> <dbl> <int> <chr> <int>
#> 1 1 1 You know, when I grew up 6
#> 2 1 2 I grew up in a very religious family 8
#> 3 1 3 I had the same sought of troubles people ~ 9
#> 4 1 4 I was excelling in alot of ways, but beca~ 21
#> 5 1 5 Im at breaking point 4
#> 6 1 6 I have no one to talk to about this and i~ 29
#> 7 1 7 I dont know what to do 6
#> 8 2 1 I feel like I’m going to explode 7
#> 9 2 2 I have so many thoughts and feelings insi~ 8
#> 10 2 3 I don't know who to tell 6
#> 11 2 4 I was going to tell my friend about it bu~ 13
#> 12 2 5 I keep saying omg 4
#> 13 2 6 it's too much 3
插图,如果能够在调用函数中指定列名对您来说很重要(实现它只需要进行一些调整){{1}}
由reprex package(v0.2.0)创建于2018-05-21。
答案 2 :(得分:1)
此解决方案首先在大写字母前用逗号或句号分隔句子。然后用逗号和句号分割句子。最后,如果句子仍然高于限制词。这些句子被每个人都分开了。
posts_sentences <- data.frame("element_id" = c(1, 1, 2, 2, 2), "sentence_id" = c(1, 2, 1, 2, 3),
"sentence" = c("You know, when I grew up, I grew up in a very religious family, I had the same sought of troubles people have, I was excelling in alot of ways, but because there was alot of trouble at home, we were always moving around", "Im at breaking point.I have no one to talk to about this and if I’m honest I think I’m too scared to tell anyone because if I do then it becomes real.I dont know what to do.", "I feel like I’m going to explode.", "I have so many thoughts and feelings inside and I don't know who to tell and I was going to tell my friend about it but I'm not sure.", "I keep saying omg!it's too much"),
"sentence_wc" = c(60, 30, 7, 20, 7), stringsAsFactors=FALSE)
# To create an empty data frame to save the new elements
new_posts_sentences <- data.frame(element_id = as.numeric(),
sentence_id =as.numeric(),
sentence = character(),
sentence_wc = as.numeric(), stringsAsFactors=FALSE)
limit_words <- 15 # 15 for this data set
countSentences <- 0
for (sentence in posts_sentences[,3]) {
vector <- character()
Velement_id <- posts_sentences$element_id[countSentences + 1]
vector <- c(vector, sentence) #To create a vector with the sentences
vector <- vector[!vector %in% ''] #remove empty elements from vector
## First we will separate the sentences that start with a uppercase after of a capital letter
if(lengths(gregexpr("[A-z]\\W+", sentence)) > limit_words ){
vector <- vector[!vector %in% sentence]
split_points <- unlist(gregexpr("[:,:]\\s[A-Z]", sentence)) # To get the character position
## If a sentences is still over the limit words value. Let's split it for each comma or period
sentences_1 <- substring(sentence, c(1, split_points + 2), c(split_points -1, nchar(sentence)))
for(sentence in sentences_1){
vector <- c(vector, sentence)
vector <- vector[!vector %in% '']
if(lengths(gregexpr("[A-z]\\W+", sentence)) > limit_words){
vector <- vector[!vector %in% sentence]
split_points <- unlist(gregexpr("[:,:]|[:.:]", sentence))
sentences_2 <- substring(sentence, c(1, split_points + 1), c(split_points -1, nchar(sentence)))
## If a sentence is still s still over the limit words value. Let's split it for each capital letter
for(sentence in sentences_2){
vector <- c(vector, sentence)
vector <- vector[!vector %in% '']
if(lengths(gregexpr("[A-z]\\W+", sentence)) > limit_words){
vector <- vector[!vector %in% sentence]
split_points <- unlist(gregexpr("[A-Z]", sentence))
sentences_3 <- substring(sentence,c(1, split_points), c(split_points -1, nchar(sentence)))
vector <- c(vector, sentences_3)
vector <- vector[!vector %in% '']
}
}
}
}
}
## To make a data frame o each original sentence
element_id <- rep(Velement_id, length(vector))
sentence_id <- 1:length(vector)
sentence_wc <- character()
for (element in vector){sentence_wc <- c(sentence_wc, (lengths(gregexpr("[A-z]\\W+", element)))) }
sentenceDataFrame <- data.frame(element_id, sentence_id, vector, sentence_wc)
## To join it with the final dataframe
new_posts_sentences <- rbind(new_posts_sentences, sentenceDataFrame)
countSentences <- countSentences + 1
}
您获得此数据框
print(new_posts_sentences)
element_id sentence_id vector sentence_wc
1 1 1 You know, when I grew up 5
2 1 2 I grew up in a very religious family 7
3 1 3 I had the same sought of troubles people have 8
4 1 4 I was excelling in alot of ways 6
5 1 5 but because there was alot of trouble at home 8
6 1 6 we were always moving around 4
7 1 1 Im at breaking point 3
8 1 2 I have no one to talk to about this and if 11
9 1 3 I’m honest 3
10 1 4 I think 2
11 1 5 I’m too scared to tell anyone because if 9
12 1 6 I do then it becomes real 5
13 1 7 I dont know what to do 5
14 2 1 I feel like I’m going to explode. 8
15 2 1 I have so many thoughts and feelings inside and 9
16 2 2 I don't know who to tell and 8
17 2 3 I was going to tell my friend about it but 10
18 2 4 I'm not sure 3
19 2 1 I keep saying omg!it's too much 7
我希望它有所帮助。
答案 3 :(得分:0)
替代tidyverse
解决方案:
library(dplyr)
library(tidyr)
library(stringr)
library(tidyverse)
library(utils)
check_and_split <- function(element_id, sentence_id, sentence, sentence_wc,
word_count, attmpt){
methods <- c("\\.", ",\\s?(?=[I])", "and\\s?(?=[A-Z])")
df <- data.frame(element_id=element_id,
sentence_id=sentence_id,
sentence=sentence,
sentence_wc=sentence_wc,
word_count=word_count,
attmpt=attmpt,
stringsAsFactors = FALSE)
if(word_count<=15 | attmpt>=3){
return(df) #early return
} else{
df %>%
tidyr::separate_rows(sentence, sep=methods[attmpt+1]) %>%
mutate(word_count=str_count(sentence,'\\w+'),
attmpt = attmpt+1)
}
}
posts_sentences %>%
mutate(word_count=str_count(sentence,'\\w+'),
attmpt=0) %>%
pmap_dfr(check_and_split) %>%
pmap_dfr(check_and_split) %>%
pmap_dfr(check_and_split)
在这里,我们创建一个辅助函数,它接受一行(由元素分解,由purrr::pmap()
提供),我们将它组装回数据帧,检查字数是否超过15并尝试尝试次数在之前的句子。然后,我们将tidyr::separate_rows()
与对应下次尝试的分隔令牌一起使用,更新word_count
和number of attempts
并返回数据框。
我正在应用相同的函数三次 - 这可能被包装成一个循环(lapply / purrr :: map将无法工作,因为我们需要更新顺序更新数据帧)。
就正则表达式令牌而言,首先我们使用文字.
,然后我们跟踪逗号和零个或多个空格,然后是“I”。请注意正向前瞻语法。最后,我们尝试“和”,可能是空间,先行后跟大写字母。
希望这是有道理的
答案 4 :(得分:0)
我认为最简单的方法是使用stringr包中的str_split()函数(根据您的正则表达式分割每个文本块),然后使用tidyr包中的unnest()函数。
sentences_split = posts_sentences %>%
mutate(text_split=str_split(sentence, pattern = "\\.")) %>%
unnest(text_split) %>%
#Count number of words in text_split
mutate(wc_split = str_count(text_split, "\\w+")) %>%
filter(wc_split!=0) %>%
#Split again if text_split column has >15 words
mutate(text_split_again = ifelse(wc_split>15, str_split(text_split, pattern = ",\\sI"), text_split)) %>%
unnest(text_split_again)