这是我希望数据框的外观:
setuptools
但是,数据(df)如下所示:
record color size height weight
1 blue large heavy
1 red
2 green small tall thin
df的代码
record vars
1 color = "blue", size = "large"
2 color = "green", size = "small"
2 height = "tall", weight = "thin"
1 color = "red", weight = "heavy"
对于每条记录,我想用“,”分隔符分隔vars列,并使用指示的变量名创建一个新列...如果特定变量有多个值,则应重复该记录< / p>
我知道要使用tidyverse进行此操作,我将需要使用dplyr :: group_by和dplyr :: separate,但是我不清楚如何将新变量名称合并到“ into”参数中以进行分离。我是否需要某种类型的正则表达式来标识等号“ =”之前的任何文本作为“ into”中的新变量名?任何建议都非常欢迎!
structure(list(record = c(1L, 2L, 2L, 1L), vars = structure(c(1L,
2L, 4L,
3L), .Label = c("color = \"blue\", size = \"large\"",
"color = \"green\", size = \"small\"", "color = \"red\", weight =
\"heavy\"",
"height = \"tall\", weight = \"thin\""), class = "factor")), class =
"data.frame", row.names = c(NA,
-4L))
答案 0 :(得分:6)
由于这些列几乎已经被用R代码定义了一个列表,因此您可以解析/评估它们,然后进行unnest_wider
library(tidyverse)
df %>%
mutate(vars = map(vars, ~ eval(parse_expr(paste('list(', .x, ')'))))) %>%
unnest_wider(vars)
# record color size height weight
# <int> <chr> <chr> <chr> <chr>
# 1 1 blue large NA NA
# 2 2 green small NA NA
# 3 2 NA NA tall thin
答案 1 :(得分:2)
这是tidyverse
的一个选项。创建一个序列列“ rn”,然后基于separate_rows
在“ vars”列中,
,用str_remove_all
除去引号,separate
将该列分成两部分,并使用pivot_wider
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(rn = row_number()) %>%
separate_rows(vars, sep=",\\s*\\n*") %>%
mutate(vars = str_remove_all(vars, '"')) %>%
separate(vars, into = c("vars1", "vars2"), sep="\\s*=\\s*") %>%
pivot_wider(names_from = vars1, values_from = vars2,
values_fill = list(vars2 = '')) %>%
select(-rn)
# A tibble: 3 x 5
# record color size height weight
# <int> <chr> <chr> <chr> <chr>
#1 1 blue large "" ""
#2 2 green small "" ""
#3 2 "" "" tall thin
答案 2 :(得分:0)
另一种方法是转换为2列矩阵并合并。我们需要一个辅助器FUN
,将向量转换为以第一行为标题的矩阵。
FUN <- function(x) {m <- matrix(x, 2);as.data.frame(rbind(`colnames<-`(m, m[1, ])[-1, ]))}
然后除去非字符内容并合并。
l <- lapply(strsplit(trimws(gsub("\\W+", " ", as.character(dat$vars))), " "), FUN)
l <- Map(`[<-`, l, 1, "record", dat$record) # cbind record column
Reduce(function(...) merge(..., all=TRUE), l) # merge
# record color weight size height
# 1 1 blue <NA> large <NA>
# 2 1 red heavy <NA> <NA>
# 3 2 green thin small tall
答案 3 :(得分:0)
我刚刚注意到,到目前为止发布的所有答案(包括accepted answer)都不完全能够再现OP的预期结果:
record color size height weight 1 blue large heavy 1 red 2 green small tall thin
它显示3行,尽管输入数据有4行。
如果我理解正确,那么由于没有相同变量的重复值,因此记录2的键值对可以排成一行。对于记录1,变量color
具有两个值,分别根据OP的请求出现在第1行和第2行
如果一个记录有多个值,则应重复记录 特定变量
记录1的所有其他变量只有一个值(或没有),并排在第1行。
因此,对于每个记录,都会创建一个底部参差不齐的子表,在该表中,各列从上到下填充(每列分别填充)。
我尝试用两种不同的方式重现此内容:首先使用data.table
(我更熟练),然后使用dplyr
/ tidyr
。最后,我将建议使用toString()
来表示重复值。
data.table
library(data.table)
library(stringr)
library(forcats)
setDT(df)[, str_split(vars, ", "), by = .(rn = seq_along(vars), record)][
, V1 := str_remove_all(V1, '"')][
, tstrsplit(V1, " = "), by = .(rn, record)][
, dcast(.SD, record + rowid(record, V1) ~ fct_inorder(V1), value.var = "V2")][
, record_1 := NULL][]
record color size height weight 1: 1 blue large <NA> heavy 2: 1 red <NA> <NA> <NA> 3: 2 green small tall thin
这可分为5个步骤:
record
和a
使用record
对rowid()
中每个键的计数,列由键(变量)给出。使用fct_inorder()
可确保按变量的出现顺序排列各列(只是为了准确再现OP的预期结果)。为了与OP的预期结果更加一致,可以通过向NA
调用中添加参数fill = ""
将dcast()
变成空白。
dplyr
/ tidyr
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate_rows(vars, sep = ", ") %>%
mutate(vars = str_remove_all(vars, '"')) %>%
separate(vars,c("key", "val")) %>%
group_by(record, key) %>%
mutate(keyid = row_number(key)) %>%
pivot_wider(id_cols = c(record, keyid), names_from = key, values_from = val) %>%
arrange(record, keyid) %>%
select(-keyid)
# A tibble: 3 x 5 # Groups: record [2] record color size height weight <int> <chr> <chr> <chr> <chr> 1 1 blue large NA heavy 2 1 red NA NA NA 3 2 green small tall thin
步骤基本上与data.table
方法相同。声明
group_by(record, key) %>%
mutate(keyid = row_number(key))
替代data.table::rowid()
。
添加参数values_fill = list(val = "")
,以空格代替NA
。
以下内容并非旨在尽可能地重现OP的预期结果,而是提出一种更简洁的结果表示方式,即每record
一行。
在重塑期间,可以使用一个函数来聚合每个单元格中的数据。 toString()
函数连接字符串。
library(data.table)
library(stringr)
library(forcats)
setDT(df)[, str_split(vars, ", "), by = .(rn = seq_along(vars), record)][
, V1 := str_remove_all(V1, '"')][
, tstrsplit(V1, " = "), by = .(rn, record)][
, dcast(.SD, record ~ fct_inorder(V1), toString, value.var = "V2")]
record color size height weight 1: 1 blue, red large heavy 2: 2 green small tall thin
或
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate_rows(vars, sep = ", ") %>%
mutate(vars = str_remove_all(vars, '"')) %>%
separate(vars,c("key", "val")) %>%
pivot_wider(names_from = key, values_from = val, values_fn = list(val = toString))
# A tibble: 2 x 5 record color size height weight <int> <chr> <chr> <chr> <chr> 1 1 blue, red large NA heavy 2 2 green small tall thin