我有一个包含100行的数据框
我在数据框中有一个由文本组成的列。
我想将文本列分成句子,以便文本列成为句子列表。
我正在使用字符串包函数stri_split_lines
示例:
rowID text
1 There is something wrong. It is bad. We made it better
2 The sky is blue. The sea is green.
所需的输出
rowID text
1 [1] There is something wrong
[2]It is bad.
[3]We made it better
2 [1]The sky is blue.
[2]The sea is green.
我尝试过
dataframe<-do.call(rbind.data.frame, stri_split_lines(dataframe$text, omit_empty = TRUE))
答案 0 :(得分:2)
在这里,是tidyverse的解决方案(不再使用stringi
):
假设您的数据帧称为df
。
解决方案
library(dplyr)
df %>%
mutate(text= strsplit(text, "(?<=[[:punct:]])\\s(?=[A-Z])", perl=T))
说明:mutate调用中的strsplit
返回一个列表,因此您现在数据帧中有一个真实的列表列。 (字符串拆分正则表达式为found here)
如果我要将列表列分成多行怎么办?
要将该列表的成员分为自己的行,您有两个选择:
只需在列表列上调用tidyr::unnest
:
df %>% tidyr::unnest(text)
在原始数据帧上使用tidyr::separate_rows
(在创建列表列之前):
df %>% tidyr::separate_rows(text, sep= "(?<=[[:punct:]])\\s(?=[A-Z])")
答案 1 :(得分:0)
示例:
dataframe[["text"]] <- strsplit(dataframe[["text"]], split = "\\.")
str(dataframe)
'data.frame': 2 obs. of 2 variables:
$ rowID: int 1 2
$ text :List of 2
..$ : chr "There is something wrong" " It is bad" " We made it better"
..$ : chr "The sky is blue" " The sea is green"
数据
dataframe <- data.frame(
rowID = 1:2,
text =
c(
"There is something wrong. It is bad. We made it better",
"The sky is blue. The sea is green."
),
stringsAsFactors = FALSE
)
答案 2 :(得分:-1)
请考虑DF
是您的data.frame:
DF <- read.table(text=
'rowID text
1 "There is something wrong. It is bad. We made it better"
2 "The sky is blue. The sea is green."', header=TRUE, stringsAsFactors=FALSE)
然后,使用R基函数可以获取所需的输出:
listText <- lapply(strsplit(DF$text, "\\."), cbind)
id <- rep(1:length(listText), lengths(listText))
data.frame(rowID = id, text = do.call(rbind, listText))
输出:
rowID text
1 1 There is something wrong
2 1 It is bad
3 1 We made it better
4 2 The sky is blue
5 2 The sea is green