Question

我有一个数据框，其中一列是字符向量，向量中的每个元素都是文档的全文。我想截断每个元素中的单词，以便最大字长为5个字符。

例如：

a <- c(1, 2)
b <- c("Words longer than five characters should be truncated",
       "Words shorter than five characters should not be modified")
df <- data.frame("file" = a, "text" = b, stringsAsFactors=FALSE)

head(df)
  file                                                      text
1    1     Words longer than five characters should be truncated
2    2 Words shorter than five characters should not be modified

这就是我想要的：

  file                                           text
1    1     Words longe than five chara shoul be trunc
2    2 Words short than five chara shoul not be modif

我尝试使用strsplit（）和strtrim（）修改每个单词（部分基于split vectors of words by every n words (vectors are in a list)）：

x <- unlist(strsplit(df$text, "\\s+"))
y <- strtrim(x, 5)
y
[1] "Words" "longe" "than"  "five"  "chara" "shoul" "be"    "trunc" "Words" "short" "than" 
[12] "five"  "chara" "shoul" "not"   "be"    "modif"

但我不知道这是否正确，因为我最终需要与正确行相关联的数据框中的单词，如上所示。

使用gsub和regex有没有办法做到这一点？

Answer 1

如果您希望利用gsub执行此任务：

> df$text <- gsub('(?=\\b\\pL{6,}).{5}\\K\\pL*', '', df$text, perl=T)
> df
#   file                                           text
# 1    1     Words longe than five chara shoul be trunc
# 2    2 Words short than five chara shoul not be modif

Answer 2

你走在正确的轨道上。但是，为了使您的想法有效，您必须为每个分隔的行执行拆分/修剪/合并。这是一种方法。为了说清楚，我的目的非常冗长，但你显然可以使用更少的线。

df$text <- sapply(df$text, function(str) {
  str <- unlist(strsplit(str, " "))
  str <- strtrim(str, 5)
  str <- paste(str, collapse = " ")
  str
})

输出：

> df
  file                                           text
1    1     Words longe than five chara shoul be trunc
2    2 Words short than five chara shoul not be modif

简短版本是

df$text <- sapply(df$text, function(str) {
  paste(strtrim(unlist(strsplit(str, " ")), 5), collapse = " ")  
})

修改

我刚刚意识到您是否可以使用gsub和正则表达式来执行此操作。即使你不需要这些，它仍然可能，但更难阅读：

df$text <- sapply(df$text, function(str) { str <- unlist(strsplit(str, " ")) str <- gsub("(?<=.{5}).+", "", str, perl = TRUE) str <- paste(str, collapse = " ") str })

正则表达式匹配5个字符后出现的任何内容，并替换那些没有任何内容的字符。 perl = TRUE是启用正则表达式后置（(?<=.{5})）所必需的。

截断R中字符向量的每个元素内的单词

2 个答案: