我有一个R数据帧data1
,如下所示:
prodID storeID Term Exit
1 1001 5 0
1 1002 4 1
1 1003 3 1
1 1004 5 0
2 1001 4 1
2 1002 3 1
2 1003 5 0
3 1001 4 1
3 1002 3 1
3 1003 5 0
4 1001 4 1
4 1002 3 1
5 1001 5 0
5 1002 4 1
5 1003 3 1
这当然是我的真实数据的高度简化格式,大约有300万行。我必须执行以下操作:
Term
列中的最大值,在data1
中插入具有NA
值的那么多列。列名称应为Week1
,Week2
,Week3
等NA
填充新列:1)如果Term
为5,则在Week1
,{{1}中插入0 },最多Week2
和Week4
中的1
2)如果Week5
为4,则在Term
,Week1
和Week2
中插入0,在Week3
中插入1,并在{{ 1}}。依此类推.... 最终输出应如下所示:
Week4
这是我尝试过的:
NA
这无助于我在必需的单元格中填充Week5
。我想保留prodID storeID Term Exit Week1 Week2 Week3 Week4 Week5
1 1001 5 0 0 0 0 0 1
1 1002 4 1 0 0 0 1 NA
1 1003 3 1 0 0 1 NA NA
1 1004 5 0 0 0 0 0 1
2 1001 4 1 0 0 0 1 NA
2 1002 3 1 0 0 1 NA NA
2 1003 5 0 0 0 0 0 1
3 1001 4 1 0 0 0 1 NA
3 1002 3 1 0 0 1 NA NA
3 1003 5 0 0 0 0 0 1
4 1001 4 1 0 0 0 1 NA
4 1002 3 1 0 0 1 NA NA
5 1001 5 0 0 0 0 0 1
5 1002 4 1 0 0 0 1 NA
5 1003 3 1 0 0 1 NA NA
的值,因为稍后我将在数据帧上进行从宽到长的数据转换。而且我知道上述方法在我的庞大数据集中不可行。任何建议都是最欢迎的。
答案 0 :(得分:3)
这是一个主意。我们可以创建所需的内容,然后拆分列。
library(dplyr)
library(data.table)
library(splitstackshape)
dat2 <- dat %>%
mutate(Week = case_when(
Term == 5 ~"0,0,0,0,1",
Term == 4 ~"0,0,0,1,NA",
Term == 3 ~"0,0,1,NA,NA",
Term == 2 ~"0,1,NA,NA,NA",
Term == 1 ~"1,NA,NA,NA,NA"
)) %>%
cSplit(splitCols = "Week")
dat2
# prodID storeID Term Exit Week_1 Week_2 Week_3 Week_4 Week_5
# 1: 1 1001 5 0 0 0 0 0 1
# 2: 1 1002 4 1 0 0 0 1 NA
# 3: 1 1003 3 1 0 0 1 NA NA
# 4: 1 1004 5 0 0 0 0 0 1
# 5: 2 1001 4 1 0 0 0 1 NA
# 6: 2 1002 3 1 0 0 1 NA NA
# 7: 2 1003 5 0 0 0 0 0 1
# 8: 3 1001 4 1 0 0 0 1 NA
# 9: 3 1002 3 1 0 0 1 NA NA
# 10: 3 1003 5 0 0 0 0 0 1
# 11: 4 1001 4 1 0 0 0 1 NA
# 12: 4 1002 3 1 0 0 1 NA NA
# 13: 5 1001 5 0 0 0 0 0 1
# 14: 5 1002 4 1 0 0 0 1 NA
# 15: 5 1003 3 1 0 0 1 NA NA
或使用此tidyverse
方法。我比以前的方法更喜欢这种方法,因为这种方法不需要手动输入星期值。
library(dplyr)
library(tidyr)
library(purrr)
dat2 <- dat %>%
mutate(Week = map2(1, Term, `:`)) %>%
unnest() %>%
group_by(prodID, Term) %>%
mutate(Week_Value = as.integer(Week == max(Week)),
Week = paste0("Week", Week)) %>%
spread(Week, Week_Value) %>%
ungroup()
dat2
# # A tibble: 15 x 9
# prodID storeID Term Exit Week1 Week2 Week3 Week4 Week5
# <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 1 1001 5 0 0 0 0 0 1
# 2 1 1002 4 1 0 0 0 1 NA
# 3 1 1003 3 1 0 0 1 NA NA
# 4 1 1004 5 0 0 0 0 0 1
# 5 2 1001 4 1 0 0 0 1 NA
# 6 2 1002 3 1 0 0 1 NA NA
# 7 2 1003 5 0 0 0 0 0 1
# 8 3 1001 4 1 0 0 0 1 NA
# 9 3 1002 3 1 0 0 1 NA NA
# 10 3 1003 5 0 0 0 0 0 1
# 11 4 1001 4 1 0 0 0 1 NA
# 12 4 1002 3 1 0 0 1 NA NA
# 13 5 1001 5 0 0 0 0 0 1
# 14 5 1002 4 1 0 0 0 1 NA
# 15 5 1003 3 1 0 0 1 NA NA
更新
我们可以使用str_pad
包中的stringr
填充0,然后再展开“周”列以对列名称进行排序。
library(tidyverse)
dat2 <- dat %>%
mutate(Week = map2(1, Term, `:`)) %>%
unnest() %>%
group_by(prodID, Term) %>%
mutate(Week_Value = as.integer(Week == max(Week)),
Week = paste0("Week", str_pad(Week, width = 3, pad = "0"))) %>%
spread(Week, Week_Value) %>%
ungroup()
dat2
# # A tibble: 15 x 9
# prodID storeID Term Exit Week001 Week002 Week003 Week004 Week005
# <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 1 1001 5 0 0 0 0 0 1
# 2 1 1002 4 1 0 0 0 1 NA
# 3 1 1003 3 1 0 0 1 NA NA
# 4 1 1004 5 0 0 0 0 0 1
# 5 2 1001 4 1 0 0 0 1 NA
# 6 2 1002 3 1 0 0 1 NA NA
# 7 2 1003 5 0 0 0 0 0 1
# 8 3 1001 4 1 0 0 0 1 NA
# 9 3 1002 3 1 0 0 1 NA NA
# 10 3 1003 5 0 0 0 0 0 1
# 11 4 1001 4 1 0 0 0 1 NA
# 12 4 1002 3 1 0 0 1 NA NA
# 13 5 1001 5 0 0 0 0 0 1
# 14 5 1002 4 1 0 0 0 1 NA
# 15 5 1003 3 1 0 0 1 NA NA
数据
dat <- read.table(text = "prodID storeID Term Exit
1 1001 5 0
1 1002 4 1
1 1003 3 1
1 1004 5 0
2 1001 4 1
2 1002 3 1
2 1003 5 0
3 1001 4 1
3 1002 3 1
3 1003 5 0
4 1001 4 1
4 1002 3 1
5 1001 5 0
5 1002 4 1
5 1003 3 1",
header = TRUE)
答案 1 :(得分:2)
这里是base R
的一个选项,其中我们循环遍历'Term',tabulate
为每个元素获取0和1,在NA
的末尾附加{{1 }}和length<-
的{{1}}元素来创建感兴趣的列
rbind
或对list
dat[paste0("Week", 1:5)] <- do.call(rbind, lapply(dat$Term,
function(x) `length<-`(tabulate(x), max(dat$Term))))
dat
# prodID storeID Term Exit Week1 Week2 Week3 Week4 Week5
#1 1 1001 5 0 0 0 0 0 1
#2 1 1002 4 1 0 0 0 1 NA
#3 1 1003 3 1 0 0 1 NA NA
#4 1 1004 5 0 0 0 0 0 1
#5 2 1001 4 1 0 0 0 1 NA
#6 2 1002 3 1 0 0 1 NA NA
#7 2 1003 5 0 0 0 0 0 1
#8 3 1001 4 1 0 0 0 1 NA
#9 3 1002 3 1 0 0 1 NA NA
#10 3 1003 5 0 0 0 0 0 1
#11 4 1001 4 1 0 0 0 1 NA
#12 4 1002 3 1 0 0 1 NA NA
#13 5 1001 5 0 0 0 0 0 1
#14 5 1002 4 1 0 0 0 1 NA
#15 5 1003 3 1 0 0 1 NA NA
答案 2 :(得分:2)
使用dplyr::mutate_at
和case_when
的选项可以基于使用column name
在quo_name(quo(.))
中查找下标整数,然后检查列号是否大于/等于/小于值。的Term
。
# First add additional columns based on maximum value of Term
df[,paste("Week", 1:max(df$Term), sep="")] <- NA
library(dplyr)
df %>% mutate_at(vars(starts_with("Week")), funs(case_when(
as.integer(sub(".*(\\d+)","\\1",quo_name(quo(.)))) < Term ~ 0L,
as.integer(sub(".*(\\d+)","\\1",quo_name(quo(.)))) == Term ~ 1L,
TRUE ~ NA_integer_
)))
# prodID storeID Term Exit Week1 Week2 Week3 Week4 Week5
# 1 1 1001 5 0 0 0 0 0 1
# 2 1 1002 4 1 0 0 0 1 NA
# 3 1 1003 3 1 0 0 1 NA NA
# 4 1 1004 5 0 0 0 0 0 1
# 5 2 1001 4 1 0 0 0 1 NA
# 6 2 1002 3 1 0 0 1 NA NA
# 7 2 1003 5 0 0 0 0 0 1
# 8 3 1001 4 1 0 0 0 1 NA
# 9 3 1002 3 1 0 0 1 NA NA
# 10 3 1003 5 0 0 0 0 0 1
# 11 4 1001 4 1 0 0 0 1 NA
# 12 4 1002 3 1 0 0 1 NA NA
# 13 5 1001 5 0 0 0 0 0 1
# 14 5 1002 4 1 0 0 0 1 NA
# 15 5 1003 3 1 0 0 1 NA NA
数据:
df <- read.table(text="
prodID storeID Term Exit
1 1001 5 0
1 1002 4 1
1 1003 3 1
1 1004 5 0
2 1001 4 1
2 1002 3 1
2 1003 5 0
3 1001 4 1
3 1002 3 1
3 1003 5 0
4 1001 4 1
4 1002 3 1
5 1001 5 0
5 1002 4 1
5 1003 3 1",
header = TRUE, stringsAsFactors = FALSE)