通过ddply对数据帧进行子集,然后在子集R上应用adply函数

时间:2014-08-27 23:54:32

标签: r plyr

我在使用plyr制定逻辑代码时遇到了一些麻烦。我的问题涉及两个不同长度的大数据帧,样本如下:

dfSample <-
 structure(list(Type = structure(c(8L, 100L, 86L, 86L, 86L, 86L, 
 33L, 8L, 105L, 44L, 36L, 107L, 107L, 78L, 33L, 105L, 99L, 10L, 
 16L, 75L), .Label = c("Alumni Services", "Anti-Virus and Malware", 
 "Application Integration", "Application Monitoring", "Application Testing", 
 "Audio Visual Support", "Audio Visual Support - CLS", "Audio Visual Support - Non-CLS", 
 "Backup Services", "Banner", "Bus and Law", "Business Analysis", 
 "Careers", "Common Learning Spaces", "Communication and Marketing", 
 "Computer Aided Assessment", "Conference Accounts", "Content Management", 
 "Database Services", "Datacentre", "Desktop Monitoring", "Desktop Software", 
 "Document Management", "Email", "Email Programs", "Encryption", 
 "Eng and the Enviro", "Equipment Disposal", "Estates and Facilities", 
 "Examination Papers", "Faculty Engagement", "Filestore Support Services", 
 "Finance Services", "General Admin Services", "General InfoSec Advice", 
 "Generic Accounts", "Grid Accounts (HPC)", "Health Sciences", 
 "High Performance Computing (HPC)", "Hosted webspace (LAMP/IIS)", 
 "HR and Payroll Services", "HR General", "HR Recruitment", "HR Systems", 
 "Hub Rooms", "Humanities", "ICT Facilities", "ID Card Services", 
 "Identity Management (User accounts)", "Identity Services", "Information Policy Breaches", 
 "Information Risk Analysis", "iSolutions Admin Services", "iSolutions Administration", 
 "IT Training and Development", "Large File Transfer", "Lecture Capture", 
 "Lecture Capture - CLS", "Lecture Capture - Non-CLS", "Legacy Corporate Systems", 
 "Library Services", "Licence Management", "Managed Print Service", 
 "Management Servers", "Media Asset Management", "Media Support", 
 "Medicine", "Meet and Greet", "Misuse and Security Incidents", 
 "Misuse Of Systems", "Mobile Apps", "Mobile Devices", "Natural and Enviro Sci", 
 "Network Access Services", "Network Services", "OS Builds", "Other Learning Systems", 
 "Personal Filestore", "Personal web pages", "Phys and Applied", 
 "Printing (Managed)", "Printing (Not MPS)", "Project Management and Resourcing", 
 "Repair", "Reporting Services", "Request for Software", "Research Filestore", 
 "Research Governance", "Research Management", "Research Output", 
  "Resource Filestore", "Risk Analysis and Assessment", "Security", 
 "Self Service Help", "Server Monitoring", "Service Hosting", 
 "ServiceLine", "Soc and Human Sci", "Software Configuration Management", 
 "Software Licensing and Management", "Software Services", "SportRec", 
 "Staff Accounts", "Staff Desktop Deployment", "Staff Desktop Services", 
 "Staff Desktop Services (Not UoS Build)", "Student Accounts", 
 "Student Admin Services", "Student Personal Workstations", "SUSSED", 
 "Switchboard", "Switchboard Infrastructure", "System Access Request", 
 "Telephony", "University Admin Services", "Unmanaged Printing", 
 "Videoconferencing", "Videoconferencing - CLS", "Videoconferencing - Non-CLS", 
 "Virtual Learning Environment (VLE)", "Visitor Accounts", "Web Statistics", 
 "Windows Core Environment"), class = "factor"), Tkt.Category = structure(c(19L, 
 17L, 17L, 17L, 17L, 17L, 2L, 19L, 5L, 2L, 9L, 9L, 9L, 4L, 2L, 
 5L, 20L, 2L, 19L, 20L), .Label = c("Communication and Collaboration", 
 "Corporate Services", "Data Centre", "Data Storage Services", 
 "Desktop IT", "Faculty IT", "Help Services", "HR", "Identity Management (User accounts)", 
 "Information Security", "Logistics", "Programmes and Projects", 
 "Quality and Testing", "Research Services", "Security", "SLO Corporate Services", 
 "Software", "Standard", "Teaching Services", "Underpinning Services", 
 "Web Services"), class = "factor"), `CreateDateTime` = structure(c(1370087940, 
 1370156160, 1370162340, 1370178840, 1370190000, 1370240400, 1370242920, 
 1370243040, 1370243040, 1370243280, 1370243280, 1370243520, 1370243580, 
 1370243880, 1370243880, 1370244000, 1370244120, 1370244240, 1370244300, 
 1370244360), class = c("POSIXct", "POSIXt")), `ClosingDateTime` = structure(c(1374501300, 
 1372068300, 1379062020, 1390487100, 1379062080, 1375090560, 1373984760, 
 1370856420, 1370440140, 1370508240, 1370338080, 1370243820, 1370243700, 
 1370255520, 1370341440, 1370248680, 1370353560, 1370338800, 1370257140, 
 1374222600), class = c("POSIXct", "POSIXt"))), .Names = c("Type", 
 "Tkt.Category", "CreateDateTime", "ClosingDateTime"
 ), row.names = c(NA, 20L), class = "data.frame")

DF2<-
 structure(list(DateTime = structure(c(1370041200, 1370052000, 
 1370062800, 1370073600, 1370084400, 1370095200, 1370106000, 1370116800, 
 1370127600, 1370138400, 1370149200, 1370160000, 1370170800, 1370181600, 
 1370192400, 1370203200, 1370214000, 1370224800, 1370235600, 1370246400
 ), class = c("POSIXct", "POSIXt"))), .Names = "DateTime", row.names = c(NA, 
 20L), class = "data.frame")

我试图根据一些条件得到dfSample子集的长度,这些条件涉及每个Tkt.Category的DF2数据,如下所示:

QCalc <- function(m) {
  adply(DF2, 1, transform, q=as.character(
                               nrow(subset(m, CreateDateTime <= DateTime & 
                                              ClosingDateTime >= DateTime))))
}

ServiceQueue <- ddply(dfSample, .(Tkt.Category), QCalc)

这似乎不起作用,所以我猜测我为ddply部分制定函数的方式一定存在问题,因为下面的代码片段在我使用所有数据时工作(不是按照分组Tkt.Category):

Q <- adply(DF2, 1, transform, q=as.character(
                                   nrow(subset(dfSample, CreateDateTime<= DateTime &
                                                         `ClosingDateTime>= DateTime))))

使用ddply时,我收到的错误消息是无法找到对象“m”。有人能指出我正确的方向来解决这个问题吗?

1 个答案:

答案 0 :(得分:0)

如果我们可以重申您的问题,我认为我们可以看到一种更简单的方法来解决它。您希望计算每种类型的故障单类别和列表中的每个时间戳,从之前,之后开始并拥有该类别的票数。在SQL中我们会写一些类似的东西:

SELECT Tkt.Category, DateTime, count(*)
FROM dfSample join DF2 on
CreateDateTime<= DateTime 
and ClosingDateTime>= DateTime
GROUP BY Tkt.Category, DateTime

但这不是SQL(虽然它应该是,你是否从关系数据库中提取这些数据?),它的R - 和基数R不允许我们使用不等式进行合并。因此,我们可以通过合并做一个小技巧,避免一起使用plyr:

dfSample$id <- rownames(dfSample)
DFc <- merge(dfSample,DF2)
DFlimited <- DFc[DFc$CreateDateTime <= DFc$DateTime & DFc$ClosingDateTime >= DFc$DateTime,]
DFagg <- aggregate(id ~ Tkt.Category + DateTime, data = DFlimited, length)

这可能非常慢,具体取决于表的大小,因为它基本上执行完全外连接然后过滤。如果您发现是这种情况,请查看Data.Table包 - 您可以查看此Stack Overflow问题了解更多信息。