我有一个很大的表,里面有一堆数据,但是相关的列是serialNumber
和date
。
我的目标是创建一个新表,该表为我提供每个序列号的连续连续几天的开始日期和结束日期。像这样:
serialNumber, minDate, maxDate
1111, 2009-02-15, 2011-07-01
1111, 2014-09-01, 2015-04-12
1111, 2017-12-11, NA
2222, 2016-07-11, 2018-07-01
通过运行下面的代码片段,我能够一次获得我需要一个序列号的数据,但是我为尝试让我的脚本以上述格式输出数据而感到困惑。
这是我的剧本:
library(RMySQL)
library(dplyr)
db <- dbConnect(MySQL(), user=username, password=password,
dbname='database', host='host')
results = data.frame(serialNumber = numeric(), minDate = as.Date(numeric(), origin="1970-01-01"), maxDate = as.Date(numeric(), origin="1970-01-01"))
queryUniqueSerialNumbers <- "SELECT DISTINCT(serialNumber) FROM myTable"
uniqueSerialNumberIds <- dbGetQuery(db, queryUniqueSerialNumbersIds)
geTimeDataForGivenSerialNumber <- function(serialNumber) {
queryTimeData <- paste0("SELECT * FROM myTable WHERE serialNumber = ", serialNumber)
timeData <- dbGetQuery(db, queryTimeData)
dateRanges <- as.vector(rle(timeData$date)$values)
unbrokenRuns <- split(as.Date(dateRanges), cumsum(c(TRUE, diff(as.Date(dateRanges)) != 1L)))
record <- createRecordOfTimeSpan(unbrokenRuns)
serialNumbers <- as.list(rep(serialNumberNumber, length(results)))
results <- cbind(serialNumbers, record)
return(results)
}
createRecordOfTimeSpan <- function(unbrokenRuns) {
mins <- lapply(unbrokenRuns, min)
maxs <- lapply(unbrokenRuns, max)
record <- data.frame(minDate = mins, maxDate = maxs)
return(record)
}
results <- as.data.frame(lapply(uniqueSerialNumbers, getTimeDataForGivenserialNumber))