我是R的初学者,也是R中data.table功能的忠实粉丝 这可能是我在论坛上关于R的第一个问题,并对任何格式错误的代码/文本道歉。
我碰巧使用了大量分类变量的数据集 我试图创建一个模拟数据集来解释这个问题:
structure(list(ID = 1234:1237,
AgeC = c("25-30", "31+", "25-30", "20-24"),
GenderC = c("female", "male", "female", "female"),
doyoubuyappleseveryday = c(NA, 1L, NA, NA),
doyoubuyapplesonceinaweek = c(1L, NA, NA, NA),
doyoubuyapplesonceinamonth = c(NA, NA, NA, NA),
doyoubuypearseveryday = c(NA, NA, NA, NA),
doyoubuypearsonceinaweek = c(NA, NA, NA, 1L),
doyoubuypearssonceinamonth = c(NA, NA, NA, NA),
doyoueatappleseveryday = c(NA, NA, NA, NA),
doyoueatapplesonceinaweek = c(1L, NA, 1L, NA),
doyoueatapplesonceinamonth = c(NA, NA, NA, NA),
doyoueatpearseveryday = c(NA, 1L, NA, NA),
doyoueatpearsonceinaweek = c(NA, NA, NA, 1L),
doyoueatpearsonceinamonth = c(1L, NA, NA, NA)),
.Names = c("ID", "AgeC", "GenderC", "doyoubuyappleseveryday",
"doyoubuyapplesonceinaweek", "doyoubuyapplesonceinamonth", "doyoubuypearseveryday",
"doyoubuypearsonceinaweek", "doyoubuypearssonceinamonth", "doyoueatappleseveryday",
"doyoueatapplesonceinaweek", "doyoueatapplesonceinamonth", "doyoueatpearseveryday",
"doyoueatpearsonceinaweek", "doyoueatpearsonceinamonth"),
row.names = c(NA, -4L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000000000380788>)
我试图用选定的列重塑数据集..使用melt()/ dcast()
这个想法是,通过一系列的melt()/ dcast()操作,我可以通过创建可以获取分类值的适当新变量来将变量减少到6.
在第一组中,我选择使用“熔化()/ dcast”函数消除“买”作为匹配字符串的列,如下所示:
id.cols.buy.m <-names(repro.dt)[-grep("buy",names(repro.dt))]
repro.buy.m<-data.table(melt(repro.dt, id.vars = id.cols.buy.m,
measure.vars = grep("buy",names(repro.dt),value=T ),
na.rm=T,variable.name = "buy.fruit",
value.name = "buy.freq"))
结果数据集会丢弃ID = 1236的整行,因为此ID的“购买”变量都是“NA”。这对我来说是一个数据丢失,因为在下一个重构操作序列中,“eat”作为匹配字符串,我打算使用之前的reshape / dcast中的清理版数据集。
如下:
#create a new variable "fruit"
repro.buy.m$fruit = apply(repro.buy.m, 1, function(u){
bool = sapply(fruit[,1], function(x) grepl(x,u[['buy.fruit']]))
if(any(bool)) fruit[bool] else NA
})
#create a fruit purchase freq column
repro.buy.m$freq.pur= apply(repro.buy.m, 1, function(u){
bool = sapply(freq.levels[,1], function(x) grepl(x, u[['buy.fruit']]))
if(!is.na(u[['buy.freq']])) freq.levels[bool] else NA
})
#now drop the redundant colums that have "buy" as match string--
id.cols.buy.m <- colnames(repro.buy.m)[-grep("buy",names(repro.buy.m))]
f <- as.formula(paste(paste(id.cols.buy.m, collapse = " + "), "~ buy.fruit"))
repro.buy.c<-data.table(dcast(data = repro.buy.m, f,
value.var ="buy.freq",
function(x) length(unique(x))))
repro.buy.c<-repro.buy.c[, which(grepl("buy", colnames(repro.buy.c))):=NULL]
通过上述步骤,我所有的“购买”变量 - 其中6个现在减少到2个 然而,我丢失了一个不买水果但吃水果的身份证 如果我不使用na.rm = T,则保留行,但是当我使用“eat”作为匹配字符串重塑变量时,它会引发另一个重复行的问题
我的最终目标是将分类变量与名为“fruit”的单个列以及相关的列“freq.pur”和“freq.eat”合并为NA,并按ID进行分组。
类似的东西(这是一个ID):
structure(list(fruit = c("apples", "pears"), freq.pur = c("onceinaweek",NA),
freq.et = c("onceinaweek", "onceinamonth")),
.Names = c("fruit", "freq.pur", "freq.et"),
row.names = c(NA, -2L), class = c("data.table", "data.frame"),
.internal.selfref = <pointer: 0x0000000000380788>, sorted = "fruit")
使用此示例数据集,我会在
上请求任何帮助最好的问候
编辑:04/21:10:15IST(预期输出的模拟数据表)
structure(list(ID = c(1234L, 1234L, 1235L, 1235L, 1236L, 1237L),
AgeC = c("25-30", "25-30", "31+", "31+", "25-30", "20-24"),
GenderC = c("female", "female", "male", "male", "female",
"female"), freq.pur = c(NA, "onceinaweek", "everyday", "everyday",
NA, "onceinaweek"), freq.et = c("onceinamonth", "onceinaweek",
"everyday", "everyday", "onceinaweek", "onceinaweek"), fruit = c("pears",
"apples", "apples", "pears", "apples", "pears")),
.Names = c("ID","AgeC", "GenderC", "freq.pur", "freq.et", "fruit"),
row.names = c(NA,-6L), class = c("data.table", "data.frame"),
.internal.selfref =)
编辑:04/25
我能够通过对中间结构和几个辅助变量的一些“独特”调用来解决这个问题,并且我能够测试所有用例。我的最终数据表如下:
dput(repro.buy.eat.final)
structure(list(ID = c(1210L, 1210L, 1234L, 1234L, 1234L, 1235L,
1235L, 1237L, 1237L, 1238L, 1238L, 1239L),
AgeC = c("25-30", "25-30", "25-30", "25-30", "25-30", "31+",
"31+", "20-24", "20-24", "25-30", "25-30", "25-30"),
GenderC = c("female", "female", "female", "female", "female", "male",
"male", "female", "female", "male", "male", "male"),
fruit = c("apples", "apples", "apples", "pears", "apples", "apples",
"pears", "pears", "pears", "apples", "pears", "pears"),
freq.et = c("everyday", NA, NA, "onceinamonth", "onceinaweek", NA,
"everyday", "onceinaweek", NA, "onceinamonth", NA, NA),
freq.pur = c(NA, "onceinaweek", "onceinaweek", NA, NA, "everyday", NA,
NA, "onceinaweek", NA, "everyday", "onceinaweek")),
row.names = c(NA, -12L),
class = c("data.table", "data.frame"),
.internal.selfref = <pointer: 0x0000000002830788>,
.Names = c("ID", "AgeC", "GenderC", "fruit", "freq.et", "freq.pur"))
在结果集中,我想合并类似的水果与类似的“买”和“吃”频率。我在这里找到了一些相关答案: R: Merge of rows in same data table, concatenating certain columns, 但是,我不知道如何应用条件来匹配频率,尽管我可以按ID和水果进行分组。 我在这里寻求帮助。 我可以在这里分享代码片段..如果这篇文章没有变得那么久。我还没有抓住格式化样式来在表格视图中附加数据表。
测试用例,我用过:
最好的问候
答案 0 :(得分:1)
这是一种方式。假设您的data.table
被称为dt
:
第1步:
以我们稍后可以拆分的方式更改列名:
setnames(dt, gsub("doyou(buy|eat)(apples|pears)", "\\1_\\2_", names(dt)))
\\1
和\\2
在[{1}}的第一个参数中捕获1>在paranthesis ()
中提供的值。基本上我已将gsub
添加到我想要的位置,以便现在名称为:
_
现在我们可以分开names(dt)
# [1] "ID" "AgeC" "GenderC"
# [4] "buy_apples_everyday" "buy_apples_onceinaweek" "buy_apples_onceinamonth"
# [7] "buy_pears_everyday" "buy_pears_onceinaweek" "buy_pears_sonceinamonth"
# [10] "eat_apples_everyday" "eat_apples_onceinaweek" "eat_apples_onceinamonth"
# [13] "eat_pears_everyday" "eat_pears_onceinaweek" "eat_pears_onceinamonth"
。我建议使用分隔符以便于阅读。
第2步:
融化你的数据。表:
_
阅读警告并尝试理解。它会发出警告,因为您的某些列都是dt.m = melt(dt, id=names(dt)[1:3])
(它们被加载为逻辑类型,而不是整数类型)。这很好,因为最终结果是整数列。所以你可以忽略警告。
第3步:
拆分 NA
上的变量列,并创建3个单独的列:
_
dt.m[, c("buy_eat", "fruit", "freq") := tstrsplit(variable, "_", fixed=TRUE)]
dt.m[, variable := NULL]
在:=
和列表上获取LHS
中相应值的字符向量(对于列名称)。 RHS
已经返回列表。通过执行以下操作分别检查tstrsplit
的输出:tstrsplit(...)
以了解它正在做什么。
tstrsplit(dt.m$variable, "_", fixed=TRUE)
删除了该列。我们不再需要:= NULL
了。所以我们删除它。当LHS是单个值时,为方便起见,我们不一定必须提供variable
,即""
和"variable"
在此处的含义相同。
第4步:
variable
value
的位置,将NA
替换为value
,将0
替换为freq
:
NA
第5步:
对于每个dt.m[is.na(value), c("value", "freq") := list(0, NA)]
,提取与最大ID, buy_eat, fruit
对应的行。
value
ans = dt.m[, .SD[which.max(value)], by=.(ID, buy_eat, fruit)]
# ID buy_eat fruit AgeC GenderC value freq
# 1: 1234 buy apples 25-30 female 1 onceinaweek
# 2: 1235 buy apples 31+ male 1 everyday
# 3: 1236 buy apples 25-30 female 0 NA
# 4: 1237 buy apples 20-24 female 0 NA
# 5: 1234 buy pears 25-30 female 0 NA
# 6: 1235 buy pears 31+ male 0 NA
# 7: 1236 buy pears 25-30 female 0 NA
# 8: 1237 buy pears 20-24 female 1 onceinaweek
# 9: 1234 eat apples 25-30 female 1 onceinaweek
# 10: 1235 eat apples 31+ male 0 NA
# 11: 1236 eat apples 25-30 female 1 onceinaweek
# 12: 1237 eat apples 20-24 female 0 NA
# 13: 1234 eat pears 25-30 female 1 onceinamonth
# 14: 1235 eat pears 31+ male 1 everyday
# 15: 1236 eat pears 25-30 female 0 NA
# 16: 1237 eat pears 20-24 female 1 onceinaweek
返回which.max(<all_NA_values>)
(长度为0的整数),这是不合需要的。这就是我们在上一步中将integer(0)
替换为value
0
的原因。
最后一步:
NA
它。
dcast
我认为这是您正在寻找的结果。如果没有,我认为这应该让你对如何处理问题有所了解。我会把剩下的修修补补给你。