Question

我有一个如下所示的字符向量。

text <- c(
  "My test",
  "Test2",
  "Tests",
  "Dolphin Sentimental S.r.l.", "Tiger Sentiyapa S.r.l.", 
  "Effort rate calculates to grant (Debt to Income Rate)", 
  "Amount of pensions received mens.", 
  "(Grant data) (Pension Received (Monthly Basis))", 
  "Effort rate calculates to grant (Debt to Income Rate)", 
  "Amount of pensions received mens.", 
  "(Grant data) (Pension Received (Monthly Basis))"
)

如果没有。整个向量中的字符数（如上所示）大于 100，将其拆分为多个字符向量。字符数 < 100。我尝试使用分位数方法，但它不起作用，因为如果您观察到向量的前 3 个元素与 5 到 11 之间的元素相比包含更少的文本，因此这种方法不可靠且容易出错。

nRun <- ceiling(sum(nchar(text),na.rm = T)/100)
cutsIter <- ceiling(quantile(1:length(text),probs = seq.int(0,1,(1/nRun))))

新的字符向量

text[cutsIter[1]:cutsIter[2]]

预期结果 前 5 个元素应该在一个向量中。 6th 和 7th 应该在同一个向量中，然后继续。

Answer 1

这是您可以做到的一种方法。我相信有更好的方法，但这个解决方案也可以改进。为此，我选择编写自定义函数。当只剩下 1 个向量的 nchar 等于 100 时，仍然存在一个问题。这应该根据您的喜好进行修复。

out <- c()
x <- nchar(text)

fn <- function(x) {
  
  if(max(cumsum(x)) < 100) {
    ind <- max(which(cumsum(x) < 100))
    return(c(out, length(x)))
  } else {
    ind <- max(which(cumsum(x) < 100))
    out <<- c(out, ind)
  }
  
  x <- x[-c(1:ind)]
  fn(x)
}

# The result of the function is the indices for us to split the vector
tmp <- fn(nchar(text))
tmp
[1] 5 2 1 2 1

如果我们将其应用于我们的向量 text：

split(text, rep(seq_len(length(tmp)), tmp))

$`1`
[1] "My test"                    "Test2"                      "Tests"                     
[4] "Dolphin Sentimental S.r.l." "Tiger Sentiyapa S.r.l."    

$`2`
[1] "Effort rate calculates to grant (Debt to Income Rate)"
[2] "Amount of pensions received mens."                    

$`3`
[1] "(Grant data) (Pension Received (Monthly Basis))"

$`4`
[1] "Effort rate calculates to grant (Debt to Income Rate)"
[2] "Amount of pensions received mens."                    

$`5`
[1] "(Grant data) (Pension Received (Monthly Basis))"

最后，如果您想创建尽可能多的向量：

split(text, rep(seq_len(length(tmp)), tmp)) |>
  setNames(paste0("vec", seq_along(tmp))) |>
  list2env(envir = globalenv())

Answer 2

有一个很棒的预定义 function MESS::cumsumbinning()，您可以在这些场景中轻松使用

text <- c(
  "My test",
  "Test2",
  "Tests",
  "Dolphin Sentimental S.r.l.", "Tiger Sentiyapa S.r.l.", 
  "Effort rate calculates to grant (Debt to Income Rate)", 
  "Amount of pensions received mens.", 
  "(Grant data) (Pension Received (Monthly Basis))", 
  "Effort rate calculates to grant (Debt to Income Rate)", 
  "Amount of pensions received mens.", 
  "(Grant data) (Pension Received (Monthly Basis))"
)

library(MESS)

split(text, cumsumbinning(nchar(text), 100))
#> $`1`
#> [1] "My test"                    "Test2"                     
#> [3] "Tests"                      "Dolphin Sentimental S.r.l."
#> [5] "Tiger Sentiyapa S.r.l."    
#> 
#> $`2`
#> [1] "Effort rate calculates to grant (Debt to Income Rate)"
#> [2] "Amount of pensions received mens."                    
#> 
#> $`3`
#> [1] "(Grant data) (Pension Received (Monthly Basis))"      
#> [2] "Effort rate calculates to grant (Debt to Income Rate)"
#> 
#> $`4`
#> [1] "Amount of pensions received mens."              
#> [2] "(Grant data) (Pension Received (Monthly Basis))"

不用说，如果您想将上述列表中的每个项目保存为单独的项目，请使用 list3env as

split(text, cumsumbinning(nchar(text), 100)) |>
  list2env(envir = .GlobalEnv)

如果您不希望超过阈值限制，请使用上面的阈值 99

split(text, cumsumbinning(nchar(text), 99))

$`1`
[1] "My test"                   
[2] "Test2"                     
[3] "Tests"                     
[4] "Dolphin Sentimental S.r.l."
[5] "Tiger Sentiyapa S.r.l."    

$`2`
[1] "Effort rate calculates to grant (Debt to Income Rate)"
[2] "Amount of pensions received mens."                    

$`3`
[1] "(Grant data) (Pension Received (Monthly Basis))"

$`4`
[1] "Effort rate calculates to grant (Debt to Income Rate)"
[2] "Amount of pensions received mens."                    

$`5`
[1] "(Grant data) (Pension Received (Monthly Basis))"

根据长度分割字符向量

2 个答案: