使用na.rm = T在选定列上进行熔合后,会完全删除数据行,尽管其余列具有有效数据

时间:2016-04-20 10:56:35

标签: r data.table reshape

我是R的初学者,也是R中data.table功能的忠实粉丝 这可能是我在论坛上关于R的第一个问题,并对任何格式错误的代码/文本道歉。

我碰巧使用了大量分类变量的数据集 我试图创建一个模拟数据集来解释这个问题:

structure(list(ID = 1234:1237, 
AgeC = c("25-30", "31+", "25-30", "20-24"), 
GenderC = c("female", "male", "female", "female"), 
doyoubuyappleseveryday = c(NA, 1L, NA, NA), 
doyoubuyapplesonceinaweek = c(1L, NA, NA, NA), 
doyoubuyapplesonceinamonth = c(NA, NA, NA, NA), 
doyoubuypearseveryday = c(NA, NA, NA, NA), 
doyoubuypearsonceinaweek = c(NA, NA, NA, 1L), 
doyoubuypearssonceinamonth = c(NA, NA, NA, NA), 
doyoueatappleseveryday = c(NA, NA, NA, NA), 
doyoueatapplesonceinaweek = c(1L, NA, 1L, NA), 
doyoueatapplesonceinamonth = c(NA, NA, NA, NA), 
doyoueatpearseveryday = c(NA, 1L, NA, NA), 
doyoueatpearsonceinaweek = c(NA, NA, NA, 1L), 
doyoueatpearsonceinamonth = c(1L, NA, NA, NA)), 
.Names = c("ID", "AgeC", "GenderC", "doyoubuyappleseveryday", 
    "doyoubuyapplesonceinaweek", "doyoubuyapplesonceinamonth", "doyoubuypearseveryday", 
    "doyoubuypearsonceinaweek", "doyoubuypearssonceinamonth", "doyoueatappleseveryday", 
    "doyoueatapplesonceinaweek", "doyoueatapplesonceinamonth", "doyoueatpearseveryday", 
    "doyoueatpearsonceinaweek", "doyoueatpearsonceinamonth"), 
    row.names = c(NA, -4L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000000000380788>)

我试图用选定的列重塑数据集..使用melt()/ dcast()

这个想法是,通过一系列的melt()/ dcast()操作,我可以通过创建可以获取分类值的适当新变量来将变量减少到6.

在第一组中,我选择使用“熔化()/ dcast”函数消除“买”作为匹配字符串的列,如下所示:

id.cols.buy.m <-names(repro.dt)[-grep("buy",names(repro.dt))]
repro.buy.m<-data.table(melt(repro.dt, id.vars = id.cols.buy.m, 
                        measure.vars = grep("buy",names(repro.dt),value=T ),
                        na.rm=T,variable.name = "buy.fruit", 
                        value.name = "buy.freq"))

结果数据集会丢弃ID = 1236的整行,因为此ID的“购买”变量都是“NA”。这对我来说是一个数据丢失,因为在下一个重构操作序列中,“eat”作为匹配字符串,我打算使用之前的reshape / dcast中的清理版数据集。

如下:

#create a new variable "fruit"
repro.buy.m$fruit = apply(repro.buy.m, 1, function(u){
bool = sapply(fruit[,1], function(x) grepl(x,u[['buy.fruit']]))
if(any(bool)) fruit[bool] else NA
})


#create a fruit purchase freq column 

repro.buy.m$freq.pur= apply(repro.buy.m, 1, function(u){
bool = sapply(freq.levels[,1], function(x) grepl(x, u[['buy.fruit']]))
if(!is.na(u[['buy.freq']])) freq.levels[bool] else NA
})

#now drop the redundant colums that have "buy"  as match string--

id.cols.buy.m <- colnames(repro.buy.m)[-grep("buy",names(repro.buy.m))]
f <- as.formula(paste(paste(id.cols.buy.m, collapse = " + "), "~ buy.fruit"))

repro.buy.c<-data.table(dcast(data = repro.buy.m, f, 
                              value.var   ="buy.freq",
                              function(x)   length(unique(x))))

repro.buy.c<-repro.buy.c[, which(grepl("buy", colnames(repro.buy.c))):=NULL]

通过上述步骤,我所有的“购买”变量 - 其中6个现在减少到2个 然而,我丢失了一个不买水果但吃水果的身份证 如果我不使用na.rm = T,则保留行,但是当我使用“eat”作为匹配字符串重塑变量时,它会引发另一个重复行的问题

我的最终目标是将分类变量与名为“fruit”的单个列以及相关的列“freq.pur”和“freq.eat”合并为NA,并按ID进行分组。

类似的东西(这是一个ID):

structure(list(fruit = c("apples", "pears"), freq.pur = c("onceinaweek",NA),
freq.et = c("onceinaweek", "onceinamonth")), 
.Names = c("fruit", "freq.pur", "freq.et"), 
row.names = c(NA, -2L), class = c("data.table", "data.frame"),
.internal.selfref = <pointer: 0x0000000000380788>, sorted = "fruit")

使用此示例数据集,我会在

上请求任何帮助
  1. 如何在部分和期间删除所选列的NA值 顺序重塑数据 - 以避免重塑后的重复行
  2. 如何合并同一数据表中不同长度的列,这些数据表由单个分类变量连接,具有唯一值(在本例中为“fruit”),按ID分组 - 再次删除无意义的行
  3. 最好的问候

    编辑:04/21:10:15IST(预期输出的模拟数据表)

    structure(list(ID = c(1234L, 1234L, 1235L, 1235L, 1236L, 1237L), 
    AgeC = c("25-30", "25-30", "31+", "31+", "25-30", "20-24"), 
    GenderC = c("female", "female", "male", "male", "female", 
    "female"), freq.pur = c(NA, "onceinaweek", "everyday", "everyday", 
    NA, "onceinaweek"), freq.et = c("onceinamonth", "onceinaweek", 
    "everyday", "everyday", "onceinaweek", "onceinaweek"), fruit = c("pears", 
    "apples", "apples", "pears", "apples", "pears")), 
    .Names = c("ID","AgeC", "GenderC", "freq.pur", "freq.et", "fruit"),
    row.names = c(NA,-6L), class = c("data.table", "data.frame"),
    

    .internal.selfref =)

    编辑:04/25

    我能够通过对中间结构和几个辅助变量的一些“独特”调用来解决这个问题,并且我能够测试所有用例。我的最终数据表如下:

    dput(repro.buy.eat.final)
    
    structure(list(ID = c(1210L, 1210L, 1234L, 1234L, 1234L, 1235L, 
    1235L, 1237L, 1237L, 1238L, 1238L, 1239L), 
    AgeC = c("25-30", "25-30", "25-30", "25-30", "25-30", "31+", 
    "31+", "20-24", "20-24", "25-30", "25-30", "25-30"), 
    GenderC = c("female", "female", "female", "female", "female", "male", 
    "male", "female", "female", "male", "male", "male"), 
    fruit = c("apples", "apples", "apples", "pears", "apples", "apples",
    "pears", "pears", "pears", "apples", "pears", "pears"), 
    freq.et = c("everyday", NA, NA, "onceinamonth", "onceinaweek", NA,
    "everyday", "onceinaweek", NA, "onceinamonth", NA, NA), 
    freq.pur = c(NA, "onceinaweek", "onceinaweek", NA, NA, "everyday", NA, 
    NA,  "onceinaweek", NA, "everyday", "onceinaweek")), 
    row.names = c(NA, -12L), 
    class = c("data.table", "data.frame"),
    .internal.selfref = <pointer: 0x0000000002830788>, 
    .Names = c("ID", "AgeC", "GenderC", "fruit", "freq.et", "freq.pur"))
    

    在结果集中,我想合并类似的水果与类似的“买”和“吃”频率。我在这里找到了一些相关答案: R: Merge of rows in same data table, concatenating certain columns, 但是,我不知道如何应用条件来匹配频率,尽管我可以按ID和水果进行分组。 我在这里寻求帮助。 我可以在这里分享代码片段..如果这篇文章没有变得那么久。我还没有抓住格式化样式来在表格视图中附加数据表。

    测试用例,我用过:

    1. 不买水果,但吃水果 -
    2. 没有吃水果,但买了水果
    3. 购买和吃同样的水果同样的频率
    4. 购买和吃同样的水果不同的频率
    5. 购买和吃diff水果相同的频率
    6. 买和吃diff fruits diff freq
    7. 最好的问候

1 个答案:

答案 0 :(得分:1)

这是一种方式。假设您的data.table被称为dt

第1步:

以我们稍后可以拆分的方式更改列名:

setnames(dt, gsub("doyou(buy|eat)(apples|pears)", "\\1_\\2_", names(dt)))

\\1\\2 在[{1}}的第一个参数中捕获在paranthesis ()中提供的值。基本上我已将gsub添加到我想要的位置,以便现在名称为:

_

现在我们可以分开names(dt) # [1] "ID" "AgeC" "GenderC" # [4] "buy_apples_everyday" "buy_apples_onceinaweek" "buy_apples_onceinamonth" # [7] "buy_pears_everyday" "buy_pears_onceinaweek" "buy_pears_sonceinamonth" # [10] "eat_apples_everyday" "eat_apples_onceinaweek" "eat_apples_onceinamonth" # [13] "eat_pears_everyday" "eat_pears_onceinaweek" "eat_pears_onceinamonth" 。我建议使用分隔符以便于阅读。

第2步:

融化你的数据。表:

_

阅读警告并尝试理解。它会发出警告,因为您的某些列都是dt.m = melt(dt, id=names(dt)[1:3]) (它们被加载为逻辑类型,而不是整数类型)。这很好,因为最终结果是整数列。所以你可以忽略警告。

第3步:

拆分 NA上的变量列,并创建3个单独的列:

_

dt.m[, c("buy_eat", "fruit", "freq") := tstrsplit(variable, "_", fixed=TRUE)] dt.m[, variable := NULL] :=列表上获取LHS中相应值的字符向量(对于列名称)。 RHS已经返回列表。通过执行以下操作分别检查tstrsplit的输出:tstrsplit(...)以了解它正在做什么。

tstrsplit(dt.m$variable, "_", fixed=TRUE)删除了该列。我们不再需要:= NULL了。所以我们删除它。当LHS是单个值时,为方便起见,我们不一定必须提供variable,即"""variable"在此处的含义相同。

第4步:

variable value的位置,将NA替换为value,将0替换为freq

NA

第5步:

对于每个dt.m[is.na(value), c("value", "freq") := list(0, NA)] ,提取与最大ID, buy_eat, fruit对应的行。

value

ans = dt.m[, .SD[which.max(value)], by=.(ID, buy_eat, fruit)] # ID buy_eat fruit AgeC GenderC value freq # 1: 1234 buy apples 25-30 female 1 onceinaweek # 2: 1235 buy apples 31+ male 1 everyday # 3: 1236 buy apples 25-30 female 0 NA # 4: 1237 buy apples 20-24 female 0 NA # 5: 1234 buy pears 25-30 female 0 NA # 6: 1235 buy pears 31+ male 0 NA # 7: 1236 buy pears 25-30 female 0 NA # 8: 1237 buy pears 20-24 female 1 onceinaweek # 9: 1234 eat apples 25-30 female 1 onceinaweek # 10: 1235 eat apples 31+ male 0 NA # 11: 1236 eat apples 25-30 female 1 onceinaweek # 12: 1237 eat apples 20-24 female 0 NA # 13: 1234 eat pears 25-30 female 1 onceinamonth # 14: 1235 eat pears 31+ male 1 everyday # 15: 1236 eat pears 25-30 female 0 NA # 16: 1237 eat pears 20-24 female 1 onceinaweek 返回which.max(<all_NA_values>)(长度为0的整数),这是不合需要的。这就是我们在上一步中将integer(0)替换为value 0的原因。

最后一步:

NA它。

dcast

我认为这是您正在寻找的结果。如果没有,我认为这应该让你对如何处理问题有所了解。我会把剩下的修修补补给你。