拆分可变长度的字符串

时间:2021-01-28 21:39:15

标签: r regex

我有一个长度为 410 的字符串(即字符数)。我想按以下方式将其拆分为子字符串:

  1. 每个子字符串必须少于 260 个字符
  2. 每个子字符串都应该以正确的单词(或完整拼写)结尾。例如:this is a test string,不能有像 this is a test st 这样的子串,它应该像 this is a test
  3. 您不能去掉字符,因此当子字符串连接在一起时,它们的读法应该与原始字符串相同。

可重现的数据:

ex_str = "This has an advantage of avoiding name conflicts i.e. what if you have a function named `DataFrame()` in your global environment. Using `pandas.DataFrame()` ensures that right function is called. To build on it further, python also provides an option of importing a function with your name of choice i.e. `import pandas as pd`. Now to call out `pandas` internal functions you can use `pd` like `pd.DataFrame()`"
nchar(ex_str)
#> [1] 410

reprex package (v0.3.0) 于 2021 年 1 月 29 日创建

预期输出:

s1 = "This has an advantage of avoiding name conflicts i.e. what if you have a function named `DataFrame()` in your global environment. Using `pandas.DataFrame()` ensures that right function is called."
s2 = "To build on it further, python also provides an option of importing a function with your name of choice i.e. `import pandas as pd`. Now to call out `pandas` internal functions you can use `pd` like `pd.DataFrame()`"
nchar(s1) #nchar() should be less than 260
#> [1] 195
nchar(s2)
#> [1] 214

reprex package (v0.3.0) 于 2021 年 1 月 29 日创建

这个问题对我来说似乎太难开始了,任何帮助将不胜感激。

1 个答案:

答案 0 :(得分:1)

spl <- strsplit(ex_str, " ")[[1]]
out <- c()
while (length(spl) > 0) {
  ind <- which((cumsum(nchar(spl)) + seq_along(spl)) > 260)[1]
  if (is.na(ind)) ind <- length(spl) + 1L
  if (ind == 1L) {
    warning("first word is too long, adding anyway", call. = FALSE)
    out <- c(out, spl[1])
    spl <- spl[-1]
  } else {
    out <- c(out, paste(spl[seq_len(ind-1)], collapse = " "))
    spl <- spl[-seq_len(ind-1)]
  }
}

nchar(out)
# [1] 253 156

out
# [1] "This has an advantage of avoiding name conflicts i.e. what if you have a function named `DataFrame()` in your global environment. Using `pandas.DataFrame()` ensures that right function is called. To build on it further, python also provides an option of"
# [2] "importing a function with your name of choice i.e. `import pandas as pd`. Now to call out `pandas` internal functions you can use `pd` like `pd.DataFrame()`"