新列基于位于其旁边的现有列和列

时间:2017-08-18 15:12:44

标签: r dplyr melt

我的数据框看起来像这样

ID t1 obs1 t2 obs2 t3 obs3
1  0  a    11 d    0  g
2  0  b    13 e    11 i
3  0  c    0  f    0  h

我需要确保每个ID至少有一个t大于10(如果没有则删除行)。然后,我想将最低t值保存在10以上,还要在新列中保存相应的obs。 (关于我的问题的复杂部分是10以上的最低t可以在任何列中)。某些t的相应障碍位于下一列,这样有帮助。所以我的结果数据框看起来像这样:

ID t1 obs1 t2 obs2 t3 obs3 lowesttabove10 correspondingobs
1  0  a    11 d    0  g    11             d
2  0  b    13 e    11 i    11             i

4 个答案:

答案 0 :(得分:4)

使用data.table,转到长格式:

import * as mongoose from 'mongoose';
mongoose.connect('mongodb://localhost:27017/TestThis');
const db = mongoose.connection;

db.on('error', console.error.bind(console, 'connection error:'));
db.once('open', async () => {
    const userRoles = new mongoose.Schema({
        roleName: String,
        user: {type: mongoose.Schema.Types.ObjectId, ref: 'User'}
    });
    const UserRoleModel = mongoose.model('UserRole', userRoles, 'UserRole');

    const userSchema = new mongoose.Schema({
        name: String,
        roles: [{type: mongoose.Schema.Types.ObjectId, ref: 'UserRole'}]
    });
    const UserModel = mongoose.model('User', userSchema, 'User');

    await UserModel.remove({});
    await UserRoleModel.remove({});

    const user = new UserModel({
        name: 'User1'
    });

    const savedUser = await user.save();

    const role = new UserRoleModel({
        roleName: 'Test',
        user: savedUser._id
    });

    await role.save();

    const users = await UserModel.find().populate('roles').exec();
    // Does not populate roles
    console.log('Users', users);

    const roles = await UserRoleModel.find().populate('user').exec();
    // Populates the user
    console.log('Roles', roles);

    process.exit();
});

查找每组的最大值并标记要保留的组:

library(data.table)
setDT(DT)
dat = melt(DT, measure.vars = patterns("^t\\d+$", "^obs\\d+$"), value.name = c("t", "obs"))
setorder(dat, ID, variable)

#    ID variable  t obs
# 1:  1        1  0   a
# 2:  1        2 11   d
# 3:  1        3  0   g
# 4:  2        1  0   b
# 5:  2        2 13   e
# 6:  2        3 11   i
# 7:  3        1  0   c
# 8:  3        2  0   f
# 9:  3        3  0   h

使用滚动更新连接查找每个保留组的最小值超过10:

IDDT = dat[order(-t), 
  .(max.variable = first(variable), max.t = first(t), max.obs = first(obs))
, by=ID]
IDDT[, keep := max.t > 10]

#    ID max.variable max.t max.obs  keep
# 1:  2            2    13       e  TRUE
# 2:  1            2    11       d  TRUE
# 3:  3            1     0       c FALSE

我会停在这里,主要数据为长格式IDDT[(keep), c("my.variable", "my.t", "my.obs") := { m = .(ID = ID, t_thresh = 10) dat[m, on=.(ID, t = t_thresh), roll=-Inf, .(x.variable, x.t, x.obs)] }] # ID max.variable max.t max.obs keep my.variable my.t my.obs # 1: 2 2 13 e TRUE 3 11 i # 2: 1 2 11 d TRUE 2 11 d # 3: 3 1 0 c FALSE NA NA NA dat级变量位于单独的表ID中。要将IDDT过滤到应保留的群组:dat。有关语法的详细信息,请参阅dat[IDDT[(keep), .(ID)], on=.(ID)]以及加载程序包时提到的其他介绍材料。

如果您坚持要回到广泛范围,请参阅?data.table

答案 1 :(得分:3)

使用基数R:

删除所有没有超过10的t值的行:

df1 <- df1[rowSums(df1[, grepl("^t", colnames(df1))] >10) > 0, ]

确定包含10以上最低值的组,然后检索值:

df1$group <- apply(df1[grepl("^t", names(df1))], 1, function(x) which(x == min(x[x > 10])))
df1 <- cbind(df1, do.call(rbind, lapply(seq_len(nrow(df1)), 
                                        function(x) setNames(df1[x, paste0(c("t", "obs"), df1$group[x])],
                                                             c("lowesttabove10", "correspondingobs")))))

> df1
  ID t1 obs1 t2 obs2 t3 obs3 group lowesttabove10 correspondingobs
1  1  0    a 11    d  0    g     2             11                d
2  2  0    b 13    e 11    i     3             11                i

答案 2 :(得分:2)

我的方法并不整洁,但仍然有效,你可以尝试一下。

library(dplyr)
library(reshape)
df1=melt(df,id='ID')

df2=df1%>%group_by(ID)%>%filter(value>10)%>%dplyr::slice(which.min(value))%>%na.omit()

> df2
# A tibble: 2 x 3
# Groups:   ID [2]
     ID variable value
  <int>   <fctr> <chr>
1     1       t2    11
2     2       t3    11


df2$variable=as.character(df2$variable)
C=as.numeric(gsub("[[:alpha:]]", "", df2$variable))
df=df[df$ID%in%df2$ID,]
for (i in 1:length(C)){
DF1=df[i,str_detect(names(df),as.character(C[i]))]
names(DF1)=c('lowesttabove10 ','correspondingobs')
if (i ==1 ){DFF=DF1}else{DFF=rbind(DFF,DF1)}
}
cbind(df,DFF)

  ID t1 obs1 t2 obs2 t3 obs3 lowesttabove10  correspondingobs
1  1  0    a 11    d  0    g              11                d
2  2  0    b 13    e 11    i              11                i

答案 3 :(得分:2)

解决方案在一个管道中使用dplyrtidyrdt是原始数据,而dt2是最终输出。

library(dplyr)
library(tidyr)

dt2 <- dt %>%
  gather(t_group, t_value, starts_with("t")) %>%
  gather(obs_group, obs_value, starts_with("obs")) %>%
  filter(gsub("t", "", t_group) == gsub("obs", "", obs_group)) %>%
  filter(t_value >= 10) %>%
  filter(t_value == min(t_value)) %>%
  select(ID, lowesttabove10 = t_value, correspondingobs = obs_value) %>%
  inner_join(dt, by = "ID") %>%
  select(colnames(dt), lowesttabove10, correspondingobs)

df2
  ID t1 obs1 t2 obs2 t3 obs3 lowesttabove10 correspondingobs
1  1  0    a 11    d  0    g             11                d
2  2  0    b 13    e 11    i             11                i

数据:

dt <- read.table(text = "ID t1 obs1 t2 obs2 t3 obs3
1  0  a    11 d    0  g
                 2  0  b    13 e    11 i
                 3  0  c    0  f    0  h",
                 header = TRUE, stringsAsFactors = FALSE)