每n个字符将字符串拆分为一个新列

时间:2018-08-05 09:54:14

标签: r substring gsub stringr

假设我有一个带有字符串向量var2的数据框

class MyForm(forms.Form):
    field = forms.CharField(label='Boo')

    def __init__(self, user, **kwargs):
       # We'll assume whatever instantiates this form knows
       # to pass the user object in; for CBVs,
       # that's an override of `get_form_kwargs()`.
       super().__init__(**kwargs)
       if user.is_superuser:
           self.fields['field'].label = 'Yay!'

最有效的方法是将每n个字符中的var2拆分为新的列,直到每个字符串的末尾,

例如,如果每4个字符,输出将如下所示:

var1  var2
1     abcdefghi 
2     abcdefghijklmnop
3     abc 
4     abcdefghijklmnopqrst

字符串包?使用“ str_split_fixed”

或使用正则表达式:

var1  var2                  new_var1  new_var2 new_var3  new_var4  new_var5
1     abcdefghi             abcd      efgh     i 
2     abcdefghijklmnop      abcd      efgh     ijkl      mnop 
3     abc                   abc
4     abcdefghijklmnopqrst  abcd      efgh     ijkl      mnop      qrst 

根据var2的长度来创建转到new_var_n的新列的能力,例如,可以为10000个字符。

5 个答案:

答案 0 :(得分:4)

这是data.table和我从this answer提取并经过稍微修改的辅助函数fixed_split的一个选项(它使用tstrsplit而不是strsplit

library(data.table)
fixed_split <- function(text, n) {
  data.table::tstrsplit(text, paste0("(?<=.{",n,"})"), perl=TRUE)
}

定义n(字符数)和new_vars(首先添加的列数)

n <- 4
new_vars <- ceiling(max(nchar(df$var2)) / n)

setDT(df)[, paste0("new_var", seq_len(new_vars)) := fixed_split(var2, n = n)][]
#   var1                 var2 new_var1 new_var2 new_var3 new_var4 new_var5
#1:    1            abcdefghi     abcd     efgh        i     <NA>     <NA>
#2:    2     abcdefghijklmnop     abcd     efgh     ijkl     mnop     <NA>
#3:    3                  abc      abc     <NA>     <NA>     <NA>     <NA>
#4:    4 abcdefghijklmnopqrst     abcd     efgh     ijkl     mnop     qrst

答案 1 :(得分:3)

或者,您可以在基本R中尝试read.fwf。不需要特殊的程序包:

tmp <- read.fwf(
    textConnection(dtf$var2),
    widths = rep(4, ceiling(max(nchar(dtf$var2) / 4))),
    stringsAsFactors = FALSE)

cbind(dtf, tmp)

#   var1                 var2   V1   V2   V3   V4   V5
# 1    1            abcdefghi abcd efgh    i <NA> <NA>
# 2    2     abcdefghijklmnop abcd efgh ijkl mnop <NA>
# 3    3                  abc  abc <NA> <NA> <NA> <NA>
# 4    4 abcdefghijklmnopqrst abcd efgh ijkl mnop qrst

答案 2 :(得分:2)

这是使用User::applyNotDeleted(User::where(function(Builder $query) use ($email) { $query->where('email', 'test@test.com')->orWhere('email', 'test2@test2.com'); }))->get(); strsplit强制的替代方法

matrix

答案 3 :(得分:0)

对同一变量使用连续的substr

  library(data.table)
  dff <- fread("var1  var2
1     abcdefghi 
2     abcdefghijklmnop
3     abc 
4     abcdefghijklmnopqrst")

  var2 <- dff[["var2"]]
  for (j in 1:5) {
    set(dff, j = paste0("new_var", j), value = substr(var2, 4*j - 3, 4*j))
  }
  dff
#>    var1                 var2 new_var1 new_var2 new_var3 new_var4 new_var5
#> 1:    1            abcdefghi     abcd     efgh        i                  
#> 2:    2     abcdefghijklmnop     abcd     efgh     ijkl     mnop         
#> 3:    3                  abc      abc                                    
#> 4:    4 abcdefghijklmnopqrst     abcd     efgh     ijkl     mnop     qrst

reprex package(v0.2.0)于2018-08-05创建。

答案 4 :(得分:0)

您可以使用tidyr::separate

library(tidyr)
n <- ((max(nchar(df$var2)) - 1) %/% 4) + 1
df %>% separate(var2, into=paste0("new_var", seq(n)), sep=seq(n-1)*4, remove = FALSE)
#   var1                 var2 new_var1 new_var2 new_var3 new_var4 new_var5
# 1    1            abcdefghi     abcd     efgh        i                  
# 2    2     abcdefghijklmnop     abcd     efgh     ijkl     mnop         
# 3    3                  abc      abc                                    
# 4    4 abcdefghijklmnopqrst     abcd     efgh     ijkl     mnop     qrst

我们首先使用整数除法计算将要拥有的组数,然后动态定义新名称并使用sep参数中的数值在相关位置进行拆分。

数据

df <- read.table(text="var1  var2
1     abcdefghi 
2     abcdefghijklmnop
3     abc 
4     abcdefghijklmnopqrst",strin=F,h=T)