如何根据具有特定条件的一列聚合来重塑数据集?

时间:2016-03-17 15:43:23

标签: r merge dataframe data.table

我的数据集基于疾病计数。许多变量都是分类的,例如WeekSeries,MonthSeries和YearSeries。这些标签指的是我的时间序列数据中疾病计数所属的周,月和年。

我面临的问题是构建另一个数据表,该数据表将根据WeekSeries,MonthSeries和YearSeries对计数进行求和。我需要我的方法来决定是否将WeekSeries 1编码为TS1 =1TS2=1。例如,在原始数据中,您可以看到第三个观察不属于TS1但属于TS2,因为它属于TS2,它有HolidaysPerSeason=10同样。

我希望该方法能够确定如果WeekSeries 1中的大多数观察结果属于TS1=1HolidaysPerSeason =11那么这将是WeekSeries=1的最终类别。

原始数据

 WeekSeries  Counts  TS1  TS2  TS3  TS4  TS5  TS6  HolidaysPerSeason
     1         0      1    0    0    0    0    0          11
     1         1      1    0    0    0    0    0          11
     1         1      0    1    0    0    0    0          10

理想格式

WeekSeries  Counts  TS1  TS2  TS3  TS4  TS5  TS6  HolidaysPerSeason
     1        2      1    0    0    0    0    0          11

此格式是构建回归模型和其他分析所必需的。

这是类似于我的真实数据的虚假数据:

    # a couple of the variables within my data
    JulianDate<-c(10985, 10986,10987)
    DateRcd<-c(NA,NA,"2000-01-31")
    Counts<-c(0,1,1)
    Day<-c("Sat","Sun","Mon")
    Weekend<-c(1,1,0)
    Season<-c(1,1,2)
    HolidaysPerSeason<-c(11,11,10)
    TS1<-c(1,1,0)
    TS2<-c(0,0,1)
    TS3<-c(0,0,0)
    TS4<-c(0,0,0)
    TS5<-c(0,0,0)
    TS6<-c(0,0,0)
    WeekSeries<-c(1,1,1)
    YearSeries<-c(1,1,1)
    MonthSeries<-c(1,1,1)
    mydata<-data.table(JulianDate,DateRcd,Counts,Day,Weekend,Season,HolidaysPerSeason, TS1,TS2,TS3,TS4,TS5,TS6,YearSeries,MonthSeries,WeekSeries) #data simulation

我尝试使用data.table()函数基于WeekSeries进行聚合,然后将其与原始数据合并,以构建我理想的分析格式。

我最接近成功的尝试

install.packages("data.table")
library(data.table)

DT <- data.table(mydata)
mydata1<-DT[, by = list(WeekSeries)] #doesn't work
mydata2<-DT[,sum(CountsofCholera), by=WeekSeries] #loses all the other variables
idealdata<-merge(mydata2,mydata,by.x=mydata2$WeekSeries) #attempts to regain  the lost variable, this doesn't work because the datasets are not the same length

我可以做些什么来重新获得其他分类变量?

2 个答案:

答案 0 :(得分:4)

这可以在几个方面进行优化,但应该给你基本的想法:

# sum up counts and count number of rows with identical values for the last several columns
DT[, .(Count = sum(Counts), .N), by = c(tail(names(DT), -4))][
   # assign same count number = total count to each row within same WeekSeries
   , Count := sum(Count), by = WeekSeries][
   # extract most frequent row (i.e. one with largest N, computed in line 1)
   , .SD[which.max(N)], by = WeekSeries]
#   WeekSeries Weekend Season HolidaysPerSeason TS1 TS2 TS3 TS4 TS5 TS6 YearSeries MonthSeries Count N
#1:          1       1      1                11   1   0   0   0   0   0          1           1     2 2

答案 1 :(得分:0)

group_by你在寻找什么?例如,这样的事情? 您应该安装command.Parameters(1) {Npgsql.NpgsqlParameter} Npgsql.NpgsqlParameter Collection {Npgsql.NpgsqlParameterCollection} Npgsql.NpgsqlParameterCollection DbType Object {13} System.Data.DbType Direction Input {1} System.Data.ParameterDirection EnumType {Name = "ScheduleLinkType" FullName = "VSData.ActionRecord+ScheduleLinkType"} System.Type {System.RuntimeType} IsNullable False Boolean NpgsqlDbType Enum {47} NpgsqlTypes.NpgsqlDbType NpgsqlValue FinishToFinish {1} Object {VSData.ActionRecord.ScheduleLinkType} ParameterName "link_type" String Precision (System.Data.Common.DbParameter) 0 Byte Precision 0 Byte Scale (System.Data.Common.DbParameter) 0 Byte Scale 0 Byte Size 0 Integer SourceColumn "" String SourceColumnNullMapping False Boolean SourceVersion Current {512} System.Data.DataRowVersion Value FinishToFinish {1} Object {VSData.ActionRecord.ScheduleLinkType} dplyr

data.table