Question

考虑一下我正在使用的非常混乱的数据集的最小工作示例：

library(dplyr)
library(tidyr)

x<- paste(sort(rep(LETTERS[1:4], 3)), paste0(rep("#", 3), rep(11:13, 3)))
y<- paste(sort(rep(LETTERS[1:4], 2)), paste0(rep(1:2, 2), rep("/0", 2)))
data<- data.frame(Item = c(x, y))

给出：

    Item
1  A #11
2  A #12
3  A #13
4  B #11
5  B #12
6  B #13
7  C #11
8  C #12
9  C #13
10 D #11
11 D #12
12 D #13
13 A 1/0
14 A 2/0
15 B 1/0
16 B 2/0
17 C 1/0
18 C 2/0
19 D 1/0
20 D 2/0

我想将项目分为项目和大小。尺码有两种。第一个11:13，由#标识。在本示例中，第二个1/0：2/0可以由/0标识。为了从项目data %>% separate(Item, into = c("Item", "Size"), sep = "#")中分离出第一尺寸类型。但是，这会在第13:20行输出NA。

如何根据条件分离变量，以使第二种尺寸类型的项目和尺寸可以分开？

我尝试了下面的代码，但没有成功。

data %>% 
        separate(Item, into = c("Item", "Size"), sep = "#") %>% 
        mutate(ifelse(grepl("/0", Item) == TRUE, separate(Item, into = c("Item", "Size"), sep = " (?=[^ ]+$)", perl=TRUE), Size))

编辑

所需的输出应如下所示：

   Item Size
1     A   11
2     A   12
3     A   13
4     B   11
5     B   12
6     B   13
7     C   11
8     C   12
9     C   13
10    D   11
11    D   12
12    D   13
13    A  1/0
14    A  2/0
15    B  1/0
16    B  2/0
17    C  1/0
18    C  2/0
19    D  1/0
20    D  2/0

Answer 1

To answer your question the | operator lets you select multiple separators.

data %>% 
  separate(Item, into = c("Item", "Size"), sep = " #| ")

Or you could use the common " " character to split everything and then clean up the column after:

data %>% 
      separate(Item, into = c("Item", "Size"), sep = " ")

See https://stringr.tidyverse.org/articles/regular-expressions.html for more regex info to help your cleaning. If it's untidy text you're gonna love and need stringR

Answer 2

I think this may be what you are looking for. Split on the space and then replace either # or /0 with blank, unless I misunderstood.

data %>%
  separate(Item, into = c("Item", "Size"), sep = " ") %>%
  mutate(Size = gsub("/0|#", "", Size))

Answer 3

由于JdbcTemplate#batchUpdate的格式为<Size，且数字>或空格后的数字，因此将转到#参数。

sep找到诸如" #(?=[0-9])"
" #1"找到诸如" [0-9]"
" 1"的意思是或

总而言之，（假设这些样式不在商品名称中出现 ）

有条件地分隔变量dplyr

3 个答案: