我有一个大型数据集,其中一小部分样本看起来像下面的4 x 5。我尝试使用变量library(splitstackshape)
dt <- tibble(
a = c("Quartz | White Spirit | Wildfire", "Quiet Riot", "Race Against Time", "Down | Heart Lane | X | Breaking H"),
b = c("Muthas Pride", "Killer Girls / Slick Black Cadillac", "Demo 1980", "Life 55"),
c = c("Split", "Single", "Demo", "Split"),
d = c("Birmingham, England | Hartlepool, England | Sheffield, South Yorkshire, England", "Los Angeles, California", "Nottingham, England", "Liverpool | Beijing | | NYC"),
e = c("wf | ef | ff", "g", "f", "cf | af | df | rf")
)
dt.s <- subset(dt, c == "Split")
dt.split <- cSplit(dt.s, c("a", "d", "e"), c("|", "|", "|"), "long")
dt.split
将多个分隔列拆分为唯一行,如下所示:
a b c d e
1: Quartz Muthas Pride Split Birmingham, England wf
2: White Spirit Muthas Pride Split Hartlepool, England ef
3: Wildfire Muthas Pride Split Sheffield, South Yorkshire, England ff
4: NA Muthas Pride Split NA NA
5: Down Life 55 Split Liverpool cf
6: Heart Lane Life 55 Split Beijing af
7: X Life 55 Split df
8: Breaking H Life 55 Split NYC rf
然而,这会强制增加一行NAs,如第4行所示:
cSplit
如果我只拆分两列,这不是问题。如何让它不生成NA行?而且,有没有办法让c
无法按{{1}}进行分组?
答案 0 :(得分:0)
当我们使用tibble
时,我们可以使用separate_rows
,但不会提供NA
行
library(tidyr)
separate_rows(dt.s, c('a', "d", "e"), sep="\\s*\\|\\s*") %>%
select_at(names(dt.s))
# A tibble: 7 x 5
# a b c d e
# <chr> <chr> <chr> <chr> <chr>
#1 Quartz Muthas Pride Split Birmingham, England wf
#2 White Spirit Muthas Pride Split Hartlepool, England ef
#3 Wildfire Muthas Pride Split Sheffield, South Yorkshire, England ff
#4 Down Life 55 Split Liverpool cf
#5 Heart Lane Life 55 Split Beijing af
#6 X Life 55 Split df
#7 Breaking H Life 55 Split NYC rf
关于为什么cSplit
给出额外的NA行,最好检查“&#39;宽”中的输出。格式
cSplit(dt.s, c("a", "d", "e"), "|")
# b c a_1 a_2 a_3 a_4 d_1 d_2 d_3 d_4 e_1 e_2 e_3 e_4
#1: Muthas Pride Split Quartz White Spirit Wildfire NA Birmingham, England Hartlepool, England Sheffield, South Yorkshire, England NA wf ef ff NA
#2: Life 55 Split Down Heart Lane X Breaking H Liverpool Beijing NYC cf af df rf
在这里,我们发现对于第二行,分隔符数|
为4,为第一行创建NA
,因为&#39; a列只有3个分隔符。所以,当我们使用&#39; long&#39;格式,此NA
行传播。这可能是一个错误。
答案 1 :(得分:0)
尝试将makeEqual = FALSE
添加到cSplit
来电:
cSplit(dt.s, c("a", "d", "e"), "|", "long", makeEqual = FALSE)
## a b c d e
## 1: Quartz Muthas Pride Split Birmingham, England wf
## 2: White Spirit Muthas Pride Split Hartlepool, England ef
## 3: Wildfire Muthas Pride Split Sheffield, South Yorkshire, England ff
## 4: Down Life 55 Split Liverpool cf
## 5: Heart Lane Life 55 Split Beijing af
## 6: X Life 55 Split df
## 7: Breaking H Life 55 Split NYC rf
此外,由于您已经使用了&#34; tidyverse&#34;中的软件包,因此您可以进行分组,如下所示:
dt %>%
filter(c == "Split") %>%
cSplit(c("a", "d", "e"), "|", "long", makeEqual = FALSE)