# Sample Data Frame
df <- data.frame(Column_A
=c("1011 Red Cat",
"Mouse 2011 is in the House 3001", "Yellow on Blue Dog walked around Park"))
我有一列要清除的手动输入数据。
Column_A
1|1011 Red Cat |
2|Mouse 2011 is in the House 3001 |
2|Yellow on Blue Dog walked around Park|
我想将每个特征分离到它自己的列中,但仍保留列A以便以后提取其他特征。
Colour Code Column_A
1|Red |1001 |Cat
2|NA |2001 3001 |Mouse is in the House
3|Yellow on Blue |NA |Dog walked around Park
到目前为止,我一直在用gsub重新排列它们并捕获组,然后使用Tidyr :: extract分离它们。
library(dplyr)
library(tidyr)
library(stringr)
df1 <- df %>%
# Reorders the Colours
mutate(Column_A = gsub("(.*?)?(Yellow|Blue|Red)(.*)?", "\\2 \\1\\3",
Column_A, perl = TRUE)) %>%
# Removes Whitespaces
mutate(Column_A =str_squish(Column_A)) %>%
# Extracts the Colours
extract(Column_A, c("Colour", "Column_A"), "(Red|Yellow|Blue)?(.*)") %>%
# Repeats the Prececding Steps for Codes
mutate(Column_A = gsub("(.*?)?(\\b\\d{1,}\\b)(.*)?", "\\2 \\1\\3",
Column_A, perl = TRUE)) %>%
mutate(Column_A =str_squish(Column_A)) %>%
extract(Column_A, c("Code", "Column_A"), "(\\b\\d{1,}\\b)?(.*)") %>%
mutate(Column_A = str_squish(Column_A))
结果如下:
Colour Code Column_A
|Red |1011 |Cat
|Yellow |NA |on Blue Dog walked around Park
|NA |1011 |Mouse is in the House 1001
这对于第一行工作正常,但对于行距和单词分隔的行则无效,我随后一直在提取和合并它们。有什么更优雅的方式做到这一点?
答案 0 :(得分:3)
这是结合使用{R}提供的颜色列表的stringr
和gsub
的解决方案:
library(dplyr)
library(stringr)
# list of colours from R colors()
cols <- as.character(colors())
apply(df,
1,
function(x)
tibble(
# Exctract CSV of colours
Color = cols[cols %in% str_split(tolower(x), " ", simplify = T)] %>%
paste0(collapse = ","),
# Extract CSV of sequential lists of digits
Code = str_extract_all(x, regex("\\d+"), simplify = T) %>%
paste0(collapse = ","),
# Remove colours and digits from Column_A
Column_A = gsub(paste0("(\\d+|",
paste0(cols, collapse = "|"),
")"), "", x, ignore.case = T) %>% trimws())) %>%
bind_rows()
# A tibble: 3 x 3
Color Code Column_A
<chr> <chr> <chr>
1 red 1011 Cat
2 "" 2011,3001 Mouse is in the House
3 blue,yellow "" on Dog walked around Park
答案 1 :(得分:2)
使用tidyverse
我们可以做到
library(tidyverse)
colors <- paste0(c("Red", "Yellow", "Blue"), collapse = "|")
df %>%
mutate(Color = str_extract(Column_A,
paste0("(", colors, ").*(", colors, ")|(", colors, ")")),
Code = str_extract_all(Column_A, "\\d+", ),
Column_A = pmap_chr(list(Color, Code, Column_A), function(x, y, z)
trimws(gsub(paste0("\\b", c(x, y), "\\b", collapse = "|"), "", z))),
Code = map_chr(Code, paste, collapse = " "))
# Column_A Color Code
#1 Cat Red 1011
#2 Mouse is in the House <NA> 2011 3001
#3 Dog walked around Park Yellow on Blue
我们首先使用colors
在两个str_extract
之间提取文本。您可以包括colors
中数据中可能出现的所有可能的颜色。我们使用paste0
来构造正则表达式。对于此示例,
paste0("(", colors, ").*(", colors, ")|(", colors, ")")
#[1] "(Red|Yellow|Blue).*(Red|Yellow|Blue)|(Red|Yellow|Blue)"
表示提取colors
之间(包括colors
之间的文本或仅提取Code
。
对于Code
部分,因为我们可以有多个str_extract_all
值,所以我们使用Column_A
并从列中获取所有数字。此部分最初存储在列表中。
对于Code
值,我们将删除Color
和gsub
中选择的所有内容,并使用Code
添加单词边界,并保存其余部分。
就像我们之前在列表中存储NA
一样,我们通过折叠将它们转换为一个字符串。这将为不匹配的值返回空字符串。您可以根据需要通过在链中添加Code = replace(Code, Code == "", NA))
将它们转换回closeret
。