Question

我有一个包含100行的数据框我在数据框中有一个由文本组成的列。我想将文本列分成句子，以便文本列成为句子列表。我正在使用字符串包函数stri_split_lines

进行拆分

示例：

rowID       text
1         There is something wrong. It is bad. We made it better
2          The sky is blue. The sea is green.

所需的输出

rowID       text 
1           [1] There is something wrong
            [2]It is bad. 
            [3]We made it better
2           [1]The sky is blue.
            [2]The sea is green.

我尝试过

dataframe<-do.call(rbind.data.frame, stri_split_lines(dataframe$text, omit_empty = TRUE))

Answer 1

在这里，是tidyverse的解决方案（不再使用stringi）：

假设您的数据帧称为df。

解决方案

  library(dplyr)

  df %>%
    mutate(text= strsplit(text, "(?<=[[:punct:]])\\s(?=[A-Z])", perl=T))

说明：mutate调用中的strsplit返回一个列表，因此您现在数据帧中有一个真实的列表列。（字符串拆分正则表达式为found here）

如果我要将列表列分成多行怎么办？

要将该列表的成员分为自己的行，您有两个选择：

只需在列表列上调用tidyr::unnest：
```
df %>% tidyr::unnest(text)
```
在原始数据帧上使用tidyr::separate_rows（在创建列表列之前）：
```
df %>% tidyr::separate_rows(text, sep= "(?<=[[:punct:]])\\s(?=[A-Z])")
```

Answer 2

示例：

dataframe[["text"]] <- strsplit(dataframe[["text"]], split = "\\.")
str(dataframe)

'data.frame':   2 obs. of  2 variables:
 $ rowID: int  1 2
 $ text :List of 2
  ..$ : chr  "There is something wrong" " It is bad" " We made it better"
  ..$ : chr  "The sky is blue" " The sea is green"

数据

dataframe <- data.frame(
  rowID = 1:2, 
  text = 
    c(
      "There is something wrong. It is bad. We made it better",
      "The sky is blue. The sea is green."
    ),
  stringsAsFactors = FALSE
)

Answer 3

请考虑DF是您的data.frame：

DF <- read.table(text=
'rowID       text
1         "There is something wrong. It is bad. We made it better"
2          "The sky is blue. The sea is green."', header=TRUE, stringsAsFactors=FALSE)

然后，使用R基函数可以获取所需的输出：

listText <- lapply(strsplit(DF$text, "\\."), cbind)
id <- rep(1:length(listText), lengths(listText))
data.frame(rowID = id, text = do.call(rbind, listText))

输出：

  rowID                     text
1     1 There is something wrong
2     1                It is bad
3     1        We made it better
4     2          The sky is blue
5     2         The sea is green

将列表列添加到数据框

3 个答案: