Question

我正在处理来自调查问题的数据有多个答案的情况。因此，已回答问题的受访者能够勾选多个方框。结果是数据集将多个答案包含在一个值中。

df <- c("VrolijkGemotiveerd", "RelaxtGemotiveerdVrolijk", "Neutraal", "TrotsGezegend", "Neutraal", "Neutraal", "VermoeidGemotiveerd")

我想将例如 RelaxtGemotiveerdVrolijk 拆分为 Column 1: Relaxt en Column 2: Gemotiveerd 和 Column 3: Vrolijk .

Answer 1

看起来您想在出现大写字母的地方拆分每个字符串，这可以使用正则表达式来完成。您可以使用许多函数以这种方式应用正则表达式，例如strsplit()、stringr::str_split() 等，但 tidyr 具有专门用于使用此方法添加新列的函数：

df <- data.frame(
    c1 = c("VrolijkGemotiveerd", "RelaxtGemotiveerdVrolijk", "Neutraal", 
           "TrotsGezegend", "Neutraal", "Neutraal", "VermoeidGemotiveerd")
)

tidyr::separate(df, c1, into = c("c2", "c3", "c4"), 
                sep = "(?<=.)(?=[[:upper:]])", fill = "right", remove = FALSE)
#>                         c1       c2          c3      c4
#> 1       VrolijkGemotiveerd  Vrolijk Gemotiveerd    <NA>
#> 2 RelaxtGemotiveerdVrolijk   Relaxt Gemotiveerd Vrolijk
#> 3                 Neutraal Neutraal        <NA>    <NA>
#> 4            TrotsGezegend    Trots    Gezegend    <NA>
#> 5                 Neutraal Neutraal        <NA>    <NA>
#> 6                 Neutraal Neutraal        <NA>    <NA>
#> 7      VermoeidGemotiveerd Vermoeid Gemotiveerd    <NA>

编辑：更新为使用@Laterow 回答中的正则表达式，因为我的有点坏了。

Answer 2

回答

假设类别总是以大写字母开头，使用 strsplit 和 perl 兼容的正则表达式：

strsplit(df, "(?<=.)(?=[[:upper:]])", perl = TRUE)

输出：

[[1]]
[1] "Vrolijk"     "Gemotiveerd"

[[2]]
[1] "Relaxt"      "Gemotiveerd" "Vrolijk"    

[[3]]
[1] "Neutraal"

[[4]]
[1] "Trots"    "Gezegend"

[[5]]
[1] "Neutraal"

[[6]]
[1] "Neutraal"

[[7]]
[1] "Vermoeid"    "Gemotiveerd"

基本原理

strsplit 让您按模式拆分字符串。正则表达式允许您对字符串中的模式进行操作。模式是找到大写字母（即 [[:upper:]]）。其他部分需要在每个大写字母处正确拆分，保持你拆分的字母，并在大写字母之前而不是之后拆分。

此代码返回一个列表，您可以使用该列表进行进一步处理。

将一列的分类值拆分为更多列

2 个答案: