将每个数据帧行文本拆分为五个均匀的文本块

时间:2017-09-30 12:04:15

标签: r dataframe

我希望能帮助解决这个棘手的字符串问题。

当前数据框

ID  Text
1   This is a very long piece of string. This contains many lines.

我想将其转换为:

ID   Text1            Text2            Text3           Text4         Text5
1    This is a        very long piece  of string.      This contains  many lines. 

字符串拆分应该在拼写均匀的单词上进行。在上面的例子中,我试图将线分开均匀地展示5次,因此每列应该包含20%的单词。

这背后的目标是将这些词语框起来,以便在对话刚刚被分割时将它们看作时间序列数据。

3 个答案:

答案 0 :(得分:4)

可能有更好的选择,但这不需要额外的包:

首先,我们创建一个reproducible example

df <- data.frame(ID=1:2,
                 Text=c("Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.",
                        "Lorem ipsum dolor sit amet, consectetur adipiscing elit"),
                 stringsAsFactors = FALSE)

然后,chunkizesplit + cut的包装,这是一个棘手的部分。它需要character,将其拆分为空格并分成n个块,然后返回data.framen个列。 (我们删除了names,以便rbind向下可以正常运行。

chunkize <- function(chr, n=5){
  x <- strsplit(chr, " ")[[1]]
  df <- as.data.frame(
    lapply(
      split(x, 
            cut(seq_along(x), 
                breaks=n)), 
      paste, collapse=" "), 
    stringsAsFactors = FALSE, col.names=NULL)
  names(df) <- NULL
  df
}

然后我们只需将它应用于每一行。我们还添加了ID列:

df_chunked <- do.call("rbind", 
                      apply(df, 1, 
                         function(x) cbind(x[1], chunkize(x[-1], 5))))

最后,我们重命名列:

colnames(df_chunked) <- c("ID", paste0("Text", 1:5))

一个方便的功能相同:

chunkize_this <- function(df, n=5){
  chunkize <- function(chr, n){
    x <- strsplit(chr, " ")[[1]]
    df <- as.data.frame(
      lapply(
        split(x, 
              cut(seq_along(x), 
                  breaks=n)), 
        paste, collapse=" "), 
      stringsAsFactors = FALSE, col.names=NULL)
    names(df) <- NULL
    df
  }

  df_chunked <- do.call("rbind", 
                        apply(df, 1, function(x) cbind(x[1], chunkize(x[-1], n))))
  colnames(df_chunked) <- c(colnames(df)[1], paste0("Text", 1:n))
  rownames(df_chunked) <- NULL
  df_chunked
}

您可以尝试使用:

View(chunkize_this(df, 3))
View(chunkize_this(df, 5))

另一个例子:

df <- read.table(h=T, text=
  'ID   Text
  1    "This is a very long piece of string. This contains many lines."
  2    "This is a very long piece of string. It contains one or two more word."
  3    "Short"'
)

> chunkize_this(df, 5)
ID     Text1           Text2         Text3           Text4                Text5
1  1 This is a       very long      piece of    string. This contains many lines.
2  2 This is a very long piece of string. It contains one or       two more word.
3  3                                   Short                                     

答案 1 :(得分:3)

data.table,基础R和tidyverse中实现的替代方法。部件数量可以硬编码或预先分配:

# pre-allocating number of parts
np <- 5

不同的选择:

1)使用&#39; data.table&#39;:

library(data.table)

# method 1
setDT(DF)[, strsplit(Text, "\\s"), by = ID
          ][, grp := rleid(cut(1:.N, np)), by = ID
            ][, paste(V1, collapse = " "), by = .(ID, grp)
              ][, dcast(.SD, ID ~ paste0('Text', grp), value.var = "V1")]

# method 2
setDT(DF)[, strsplit(Text, ' '), by = ID
          ][, grp := {s <- ceiling(.N/np); rleid(s:(.N+s-1) %/% (.N/np))}, by = ID
            ][, paste(V1, collapse = ' '), by = .(ID, grp)
              ][, dcast(.SD, ID ~ paste0('Text', grp), value.var = 'V1')]

两者都给出了:

   ID     Text1           Text2         Text3           Text4                Text5
1:  1   This is     a very long      piece of    string. This contains many lines.
2:  2 This is a very long piece of string. It contains one or      two more words.
3:  3     Short            text            NA              NA                   NA

2)基地R:

# method 1
equal_parts <- function(x, np = 5) {
  n <- cut(seq_along(x), np)
  n <- as.integer(n)
  cumsum(c(1, diff(n) > 0))
}

# method 2
equal_parts <- function(x, np = 5) {
  n <- length(x)
  s <- ceiling(n/np)
  rl <- rle(s:(n+s-1) %/% (n/np))$lengths
  rep(seq_along(rl), rl)
}

DF.long <- stack(setNames(strsplit(DF$Text, ' '), DF$ID))

DF.long$grp <- with(DF.long, ave(values, ind, FUN =  equal_parts))
DF.agg <- aggregate(values ~ ind + grp, DF.long, paste0, collapse = ' ')

reshape(DF.agg, idvar = 'ind', timevar = 'grp', direction = 'wide')

给出:

  ind  values.1        values.2      values.3        values.4             values.5
1   1   This is     a very long      piece of    string. This contains many lines.
2   2 This is a very long piece of string. It contains one or      two more words.
3   3     Short            text          <NA>            <NA>                 <NA>

3)&#39; tidyverse&#39;

library(dplyr)
library(tidyr)
separate_rows(DF, Text) %>% 
  group_by(ID) %>% 
  mutate(grp = equal_parts(Text)) %>%     # using the 'equal_parts'-function from the base R solution
  group_by(grp, add = TRUE) %>% 
  summarise(Text = paste0(Text, collapse = ' ')) %>% 
  spread(grp, Text)

给出:

# A tibble: 3 x 6
# Groups:   ID [3]
     ID       `1`             `2`           `3`             `4`                  `5`
* <int>     <chr>           <chr>         <chr>           <chr>                <chr>
1     1   This is     a very long      piece of    string. This contains many lines.
2     2 This is a very long piece of string. It contains one or      two more words.
3     3     Short            text          <NA>            <NA>                 <NA>

使用过的数据:

DF <- structure(list(ID = 1:3, Text = c("This is a very long piece of string. This contains many lines.", 
                                        "This is a very long piece of string. It contains one or two more words.", 
                                        "Short text")),
                .Names = c("ID", "Text"), row.names = c(NA, -3L), class = "data.frame")

答案 2 :(得分:1)

OP提供的数据帧只有一行。因此,在text中多行具有不同数量的行的情况下,不清楚预期结果是什么。是否需要

  1. 生成的列包含相同数量的单词(如果有足够的单词可用),或
  2. 每行分开拆分?
  3. 案例1的解决方案

    如果要求是每列应在所有行中包含相同数量的单词(如果有足够的单词可用),则单词最多的行将确定分布。从左侧填充具有较少单词的行的列(左对齐)。

    library(data.table)
    n_brks <- 5L
    setDT(DT)[, strsplit(Text, "\\s"), by = ID][
      , paste(V1, collapse = " "), by = .(ID, cut(rowid(ID), n_brks))][
        , dcast(.SD, ID ~ rowid(ID, prefix = "Text"), fill = "", value.var = "V1")]
    
       ID      Text1           Text2           Text3                Text4           Text5
    1:  1  This is a very long piece of string. This contains many lines.                
    2:  2  This is a very long piece   of string. It      contains one or two more words.
    3:  3 Short text                                                                     
    4:  4    Shorter
    

    Text1Text4包含第1行和第2行的相同数量的字(每个3个)。字数少于列的行从中填充左

    数据

    library(data.table)
    
    DT <- fread(
      'ID   Text
       1    "This is a very long piece of string. This contains many lines."
       2    "This is a very long piece of string. It contains one or two more words."
       3    "Short text"
       4     "Shorter"')
    

    解释

    在对data.table进行coersion之后,每行中的文本在字边界处被分割并以长格式返回(可能被视为等同于时间序列):

    n_brks <- 5L
    setDT(DT)[, strsplit(Text, "\\s"), by = ID]
    
        ID       V1
     1:  1     This
     2:  1       is
     3:  1        a
     4:  1     very
     5:  1     long
     6:  1    piece
     7:  1       of
     8:  1  string.
     9:  1     This
    10:  1 contains
    11:  1     many
    12:  1   lines.
    13:  2     This
    14:  2       is
    15:  2        a
    16:  2     very
    17:  2     long
    18:  2    piece
    19:  2       of
    20:  2  string.
    21:  2       It
    22:  2 contains
    23:  2      one
    24:  2       or
    25:  2      two
    26:  2     more
    27:  2   words.
    28:  3    Short
    29:  3     text
    30:  4  Shorter
        ID       V1
    

    然后使用计算的分组变量再次连接单词,该变量使用cut()编号上的rowdid()函数创建n_brks块:

    setDT(DT)[, strsplit(Text, "\\s"), by = ID][
      , paste(V1, collapse = " "), by = .(ID, cut(rowid(ID), n_brks))]
    
        ID         cut                   V1
     1:  1 (0.986,3.8]            This is a
     2:  1   (3.8,6.6]      very long piece
     3:  1   (6.6,9.4]      of string. This
     4:  1  (9.4,12.2] contains many lines.
     5:  2 (0.986,3.8]            This is a
     6:  2   (3.8,6.6]      very long piece
     7:  2   (6.6,9.4]        of string. It
     8:  2  (9.4,12.2]      contains one or
     9:  2   (12.2,15]      two more words.
    10:  3 (0.986,3.8]           Short text
    11:  4 (0.986,3.8]              Shorter
    

    最后,此结果再次从长格式转换为宽格式以创建预期结果。列标题由rowid()函数创建,缺失值由""替换:

    setDT(DT)[, strsplit(Text, "\\s"), by = ID][
      , paste(V1, collapse = " "), by = .(ID, cut(rowid(ID), n_brks))][
        , dcast(.SD, ID ~ rowid(ID, prefix = "Text"), fill = "", value.var = "V1")]
    

    案例2的解决方案

    如果要求是每个行应单独分割并且单词均匀分布,则每列中的单词数将因列而异。单词少于列的行最多每列一个单词。

    此案例的解决方案是Jaaps's suggestion的修改:

    library(data.table)
    n_brks <- 5L
    setDT(DT)[, strsplit(Text, "\\s"), by = ID][
      , ri := cut(seq_len(.N), n_brks), by = ID][
        , paste(V1, collapse = " "), by = .(ID, ri)][
          , dcast(.SD, ID ~ rowid(ID, prefix = "Text"), fill = "", value.var = "V1")]
    
       ID     Text1           Text2         Text3           Text4                Text5
    1:  1 This is a       very long      piece of    string. This contains many lines.
    2:  2 This is a very long piece of string. It contains one or      two more words.
    3:  3     Short            text                                                   
    4:  4   Shorter
    

    现在,每列中的单词数按行变化。例如,列Text2Text4在行1中各有2个单词,在行2中各有3个单词。第3行的2个单词放在不同的列中。