我希望能帮助解决这个棘手的字符串问题。
当前数据框
ID Text
1 This is a very long piece of string. This contains many lines.
我想将其转换为:
ID Text1 Text2 Text3 Text4 Text5
1 This is a very long piece of string. This contains many lines.
字符串拆分应该在拼写均匀的单词上进行。在上面的例子中,我试图将线分开均匀地展示5次,因此每列应该包含20%的单词。
这背后的目标是将这些词语框起来,以便在对话刚刚被分割时将它们看作时间序列数据。
答案 0 :(得分:4)
可能有更好的选择,但这不需要额外的包:
首先,我们创建一个reproducible example:
df <- data.frame(ID=1:2,
Text=c("Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.",
"Lorem ipsum dolor sit amet, consectetur adipiscing elit"),
stringsAsFactors = FALSE)
然后,chunkize
是split
+ cut
的包装,这是一个棘手的部分。它需要character
,将其拆分为空格并分成n
个块,然后返回data.frame
个n
个列。 (我们删除了names
,以便rbind
向下可以正常运行。
chunkize <- function(chr, n=5){
x <- strsplit(chr, " ")[[1]]
df <- as.data.frame(
lapply(
split(x,
cut(seq_along(x),
breaks=n)),
paste, collapse=" "),
stringsAsFactors = FALSE, col.names=NULL)
names(df) <- NULL
df
}
然后我们只需将它应用于每一行。我们还添加了ID
列:
df_chunked <- do.call("rbind",
apply(df, 1,
function(x) cbind(x[1], chunkize(x[-1], 5))))
最后,我们重命名列:
colnames(df_chunked) <- c("ID", paste0("Text", 1:5))
一个方便的功能相同:
chunkize_this <- function(df, n=5){
chunkize <- function(chr, n){
x <- strsplit(chr, " ")[[1]]
df <- as.data.frame(
lapply(
split(x,
cut(seq_along(x),
breaks=n)),
paste, collapse=" "),
stringsAsFactors = FALSE, col.names=NULL)
names(df) <- NULL
df
}
df_chunked <- do.call("rbind",
apply(df, 1, function(x) cbind(x[1], chunkize(x[-1], n))))
colnames(df_chunked) <- c(colnames(df)[1], paste0("Text", 1:n))
rownames(df_chunked) <- NULL
df_chunked
}
您可以尝试使用:
View(chunkize_this(df, 3))
View(chunkize_this(df, 5))
另一个例子:
df <- read.table(h=T, text=
'ID Text
1 "This is a very long piece of string. This contains many lines."
2 "This is a very long piece of string. It contains one or two more word."
3 "Short"'
)
> chunkize_this(df, 5)
ID Text1 Text2 Text3 Text4 Text5
1 1 This is a very long piece of string. This contains many lines.
2 2 This is a very long piece of string. It contains one or two more word.
3 3 Short
答案 1 :(得分:3)
在data.table
,基础R和tidyverse
中实现的替代方法。部件数量可以硬编码或预先分配:
# pre-allocating number of parts
np <- 5
不同的选择:
1)使用&#39; data.table&#39;:
library(data.table)
# method 1
setDT(DF)[, strsplit(Text, "\\s"), by = ID
][, grp := rleid(cut(1:.N, np)), by = ID
][, paste(V1, collapse = " "), by = .(ID, grp)
][, dcast(.SD, ID ~ paste0('Text', grp), value.var = "V1")]
# method 2
setDT(DF)[, strsplit(Text, ' '), by = ID
][, grp := {s <- ceiling(.N/np); rleid(s:(.N+s-1) %/% (.N/np))}, by = ID
][, paste(V1, collapse = ' '), by = .(ID, grp)
][, dcast(.SD, ID ~ paste0('Text', grp), value.var = 'V1')]
两者都给出了:
ID Text1 Text2 Text3 Text4 Text5 1: 1 This is a very long piece of string. This contains many lines. 2: 2 This is a very long piece of string. It contains one or two more words. 3: 3 Short text NA NA NA
2)基地R:
# method 1
equal_parts <- function(x, np = 5) {
n <- cut(seq_along(x), np)
n <- as.integer(n)
cumsum(c(1, diff(n) > 0))
}
# method 2
equal_parts <- function(x, np = 5) {
n <- length(x)
s <- ceiling(n/np)
rl <- rle(s:(n+s-1) %/% (n/np))$lengths
rep(seq_along(rl), rl)
}
DF.long <- stack(setNames(strsplit(DF$Text, ' '), DF$ID))
DF.long$grp <- with(DF.long, ave(values, ind, FUN = equal_parts))
DF.agg <- aggregate(values ~ ind + grp, DF.long, paste0, collapse = ' ')
reshape(DF.agg, idvar = 'ind', timevar = 'grp', direction = 'wide')
给出:
ind values.1 values.2 values.3 values.4 values.5 1 1 This is a very long piece of string. This contains many lines. 2 2 This is a very long piece of string. It contains one or two more words. 3 3 Short text <NA> <NA> <NA>
3)&#39; tidyverse&#39;
library(dplyr)
library(tidyr)
separate_rows(DF, Text) %>%
group_by(ID) %>%
mutate(grp = equal_parts(Text)) %>% # using the 'equal_parts'-function from the base R solution
group_by(grp, add = TRUE) %>%
summarise(Text = paste0(Text, collapse = ' ')) %>%
spread(grp, Text)
给出:
# A tibble: 3 x 6 # Groups: ID [3] ID `1` `2` `3` `4` `5` * <int> <chr> <chr> <chr> <chr> <chr> 1 1 This is a very long piece of string. This contains many lines. 2 2 This is a very long piece of string. It contains one or two more words. 3 3 Short text <NA> <NA> <NA>
使用过的数据:
DF <- structure(list(ID = 1:3, Text = c("This is a very long piece of string. This contains many lines.",
"This is a very long piece of string. It contains one or two more words.",
"Short text")),
.Names = c("ID", "Text"), row.names = c(NA, -3L), class = "data.frame")
答案 2 :(得分:1)
OP提供的数据帧只有一行。因此,在text
中多行具有不同数量的行的情况下,不清楚预期结果是什么。是否需要
如果要求是每列应在所有行中包含相同数量的单词(如果有足够的单词可用),则单词最多的行将确定分布。从左侧填充具有较少单词的行的列(左对齐)。
library(data.table)
n_brks <- 5L
setDT(DT)[, strsplit(Text, "\\s"), by = ID][
, paste(V1, collapse = " "), by = .(ID, cut(rowid(ID), n_brks))][
, dcast(.SD, ID ~ rowid(ID, prefix = "Text"), fill = "", value.var = "V1")]
ID Text1 Text2 Text3 Text4 Text5 1: 1 This is a very long piece of string. This contains many lines. 2: 2 This is a very long piece of string. It contains one or two more words. 3: 3 Short text 4: 4 Shorter
列Text1
到Text4
包含第1行和第2行的相同数量的字(每个3个)。字数少于列的行从中填充左
library(data.table)
DT <- fread( 'ID Text 1 "This is a very long piece of string. This contains many lines." 2 "This is a very long piece of string. It contains one or two more words." 3 "Short text" 4 "Shorter"')
在对data.table进行coersion之后,每行中的文本在字边界处被分割并以长格式返回(可能被视为等同于时间序列):
n_brks <- 5L
setDT(DT)[, strsplit(Text, "\\s"), by = ID]
ID V1 1: 1 This 2: 1 is 3: 1 a 4: 1 very 5: 1 long 6: 1 piece 7: 1 of 8: 1 string. 9: 1 This 10: 1 contains 11: 1 many 12: 1 lines. 13: 2 This 14: 2 is 15: 2 a 16: 2 very 17: 2 long 18: 2 piece 19: 2 of 20: 2 string. 21: 2 It 22: 2 contains 23: 2 one 24: 2 or 25: 2 two 26: 2 more 27: 2 words. 28: 3 Short 29: 3 text 30: 4 Shorter ID V1
然后使用计算的分组变量再次连接单词,该变量使用cut()
编号上的rowdid()
函数创建n_brks
块:
setDT(DT)[, strsplit(Text, "\\s"), by = ID][
, paste(V1, collapse = " "), by = .(ID, cut(rowid(ID), n_brks))]
ID cut V1 1: 1 (0.986,3.8] This is a 2: 1 (3.8,6.6] very long piece 3: 1 (6.6,9.4] of string. This 4: 1 (9.4,12.2] contains many lines. 5: 2 (0.986,3.8] This is a 6: 2 (3.8,6.6] very long piece 7: 2 (6.6,9.4] of string. It 8: 2 (9.4,12.2] contains one or 9: 2 (12.2,15] two more words. 10: 3 (0.986,3.8] Short text 11: 4 (0.986,3.8] Shorter
最后,此结果再次从长格式转换为宽格式以创建预期结果。列标题由rowid()
函数创建,缺失值由""
替换:
setDT(DT)[, strsplit(Text, "\\s"), by = ID][
, paste(V1, collapse = " "), by = .(ID, cut(rowid(ID), n_brks))][
, dcast(.SD, ID ~ rowid(ID, prefix = "Text"), fill = "", value.var = "V1")]
如果要求是每个行应单独分割并且单词均匀分布,则每列中的单词数将因列而异。单词少于列的行最多每列一个单词。
此案例的解决方案是Jaaps's suggestion的修改:
library(data.table)
n_brks <- 5L
setDT(DT)[, strsplit(Text, "\\s"), by = ID][
, ri := cut(seq_len(.N), n_brks), by = ID][
, paste(V1, collapse = " "), by = .(ID, ri)][
, dcast(.SD, ID ~ rowid(ID, prefix = "Text"), fill = "", value.var = "V1")]
ID Text1 Text2 Text3 Text4 Text5 1: 1 This is a very long piece of string. This contains many lines. 2: 2 This is a very long piece of string. It contains one or two more words. 3: 3 Short text 4: 4 Shorter
现在,每列中的单词数按行变化。例如,列Text2
到Text4
在行1中各有2个单词,在行2中各有3个单词。第3行的2个单词放在不同的列中。