我的数据集基于疾病计数。许多变量都是分类的,例如WeekSeries,MonthSeries和YearSeries。这些标签指的是我的时间序列数据中疾病计数所属的周,月和年。
我面临的问题是构建另一个数据表,该数据表将根据WeekSeries,MonthSeries和YearSeries对计数进行求和。我需要我的方法来决定是否将WeekSeries 1编码为TS1 =1
或TS2=1
。例如,在原始数据中,您可以看到第三个观察不属于TS1
但属于TS2
,因为它属于TS2
,它有HolidaysPerSeason=10
同样。
我希望该方法能够确定如果WeekSeries 1中的大多数观察结果属于TS1=1
和HolidaysPerSeason =11
那么这将是WeekSeries=1
的最终类别。
WeekSeries Counts TS1 TS2 TS3 TS4 TS5 TS6 HolidaysPerSeason
1 0 1 0 0 0 0 0 11
1 1 1 0 0 0 0 0 11
1 1 0 1 0 0 0 0 10
WeekSeries Counts TS1 TS2 TS3 TS4 TS5 TS6 HolidaysPerSeason
1 2 1 0 0 0 0 0 11
此格式是构建回归模型和其他分析所必需的。
这是类似于我的真实数据的虚假数据:
# a couple of the variables within my data
JulianDate<-c(10985, 10986,10987)
DateRcd<-c(NA,NA,"2000-01-31")
Counts<-c(0,1,1)
Day<-c("Sat","Sun","Mon")
Weekend<-c(1,1,0)
Season<-c(1,1,2)
HolidaysPerSeason<-c(11,11,10)
TS1<-c(1,1,0)
TS2<-c(0,0,1)
TS3<-c(0,0,0)
TS4<-c(0,0,0)
TS5<-c(0,0,0)
TS6<-c(0,0,0)
WeekSeries<-c(1,1,1)
YearSeries<-c(1,1,1)
MonthSeries<-c(1,1,1)
mydata<-data.table(JulianDate,DateRcd,Counts,Day,Weekend,Season,HolidaysPerSeason, TS1,TS2,TS3,TS4,TS5,TS6,YearSeries,MonthSeries,WeekSeries) #data simulation
我尝试使用data.table()
函数基于WeekSeries进行聚合,然后将其与原始数据合并,以构建我理想的分析格式。
install.packages("data.table")
library(data.table)
DT <- data.table(mydata)
mydata1<-DT[, by = list(WeekSeries)] #doesn't work
mydata2<-DT[,sum(CountsofCholera), by=WeekSeries] #loses all the other variables
idealdata<-merge(mydata2,mydata,by.x=mydata2$WeekSeries) #attempts to regain the lost variable, this doesn't work because the datasets are not the same length
我可以做些什么来重新获得其他分类变量?
答案 0 :(得分:4)
这可以在几个方面进行优化,但应该给你基本的想法:
# sum up counts and count number of rows with identical values for the last several columns
DT[, .(Count = sum(Counts), .N), by = c(tail(names(DT), -4))][
# assign same count number = total count to each row within same WeekSeries
, Count := sum(Count), by = WeekSeries][
# extract most frequent row (i.e. one with largest N, computed in line 1)
, .SD[which.max(N)], by = WeekSeries]
# WeekSeries Weekend Season HolidaysPerSeason TS1 TS2 TS3 TS4 TS5 TS6 YearSeries MonthSeries Count N
#1: 1 1 1 11 1 0 0 0 0 0 1 1 2 2
答案 1 :(得分:0)
group_by你在寻找什么?例如,这样的事情?
您应该安装command.Parameters(1) {Npgsql.NpgsqlParameter} Npgsql.NpgsqlParameter
Collection {Npgsql.NpgsqlParameterCollection} Npgsql.NpgsqlParameterCollection
DbType Object {13} System.Data.DbType
Direction Input {1} System.Data.ParameterDirection
EnumType {Name = "ScheduleLinkType" FullName = "VSData.ActionRecord+ScheduleLinkType"} System.Type {System.RuntimeType}
IsNullable False Boolean
NpgsqlDbType Enum {47} NpgsqlTypes.NpgsqlDbType
NpgsqlValue FinishToFinish {1} Object {VSData.ActionRecord.ScheduleLinkType}
ParameterName "link_type" String
Precision (System.Data.Common.DbParameter) 0 Byte
Precision 0 Byte
Scale (System.Data.Common.DbParameter) 0 Byte
Scale 0 Byte
Size 0 Integer
SourceColumn "" String
SourceColumnNullMapping False Boolean
SourceVersion Current {512} System.Data.DataRowVersion
Value FinishToFinish {1} Object {VSData.ActionRecord.ScheduleLinkType}
和dplyr
。
data.table