使用cSplit用大写字母将字符串分成多行

时间:2019-09-06 17:16:03

标签: r strsplit csplit

我有调查数据。一些问题允许有多个答案。在我的数据中,不同的答案用逗号分隔。我想为每个选择在数据框中添加新行。所以我有这样的东西:

survey$q1 <- c("I like this", "I like that", "I like this, but not much",
 "I like that, but not much", "I like this,I like that", 
"I like this, but not much,I like that")

如果只用逗号分隔我要使用的多个选择:

survey <- cSplit(survey, "q1", ",", direction = "long")

并获得所需的结果。 考虑到答案中有一些逗号,我尝试使用逗号后跟大写字母作为分隔符:

survey <- cSplit(survey, "q1", ",(?=[A-Z])", direction = "long")

但是由于某种原因,它不起作用。它不会产生任何错误,但不会拆分字符串,并且还会从数据框中删除一些行。 然后,我尝试使用strsplit:

strsplit(survey$1, ",(?=[A-Z])", perl=T)

可以正确地拆分它,但是我无法实现它,因此每个句子都像cSplit一样变成同一列的不同行。 所需的输出是:

survey$q1
[1] "I like this"
[2] "I like that"
[3] "I like this, but not much"
[4] "I like that, but not much"
[5] "I like this"
[6] "I like that"
[7] "I like this, but not much"
[8] "I like that"

是否可以使用两种方法之一来获得它?谢谢

2 个答案:

答案 0 :(得分:2)

带有separate_rows

的选项
library(dplyr)
library(tidyr)
survey %>% 
   separate_rows(q1, sep=",(?=[A-Z])")
#                       q1
#1               I like this
#2               I like that
#3 I like this, but not much
#4 I like that, but not much
#5               I like this
#6               I like that
#7 I like this, but not much
#8               I like that

对于cSplit,有一个参数fixed,默认情况下为TRUE,但是如果我们使用fixed = FALSE,它可能会失败。可能是因为它没有针对PCRE regex表达进行优化

library(splitstackshape)
cSplit(survey, "q1", ",(?=[A-Z])", direction = "long", fixed = FALSE)
  

strsplit中的错误(indt [[splitCols [x]]],split = sep [x],fixed =已修复)   :无效的正则表达式',(?? [[A-Z])',原因'无效的正则表达式'

绕过它的一种选择是使用函数(sub/gsub)修改列,该函数可以使用PCRE正则表达式来更改sep,然后在该{{1上使用cSplit }}

sep

数据

cSplit(transform(survey, q1 = sub(",(?=[A-Z])", ":", q1, perl = TRUE)), 
         "q1", sep=":", direction = "long")
#                        q1
#1:               I like this
#2:               I like that
#3: I like this, but not much
#4: I like that, but not much
#5:               I like this
#6:               I like that
#7: I like this, but not much
#8:               I like that

答案 1 :(得分:1)

@akrun的答案是正确的答案。 我只是想补充一下,如果您需要将一些字符串分成两个以上的部分,那么他的代码的工作方式就是简单地多次运行同一行。 我不完全确定为什么会这样,但是它能起作用