我有一个交易数据库,如下所示:
AccountID PaymentDate PaymentAmount
8 13 2020-02-09 1.00
9 13 2020-01-25 4.20
10 14 2020-01-01 30.68
11 14 2020-02-01 30.68
PaymentDate采用posix格式。对于事务数据,我不希望按时间间隔聚合(这是有详细记录的),而是按ID。
使用带有Posix时间的min()给出第一天,max()给出最后一天。这是每个ID所需的信息。
好的,这是我试过的:
# 1.
summaryBy(PaymentDate ~ AccountID, data1, FUN=c(min) )
Fehler in tapply(lh.data[, lh.var[vv]], rh.string.factor, function(x) { : arguments must have same length
# 2.
ddply( data1, "AccountID", summarise, min(PaymentDate))
# returns 0 and warnings:
50: In output[[var]][rng] <- df[[var]] : Anzahl der zu ersetzenden Elemente ist kein Vielfaches der Ersetzungslänge
# 3.
aggregate(PaymentDate ~ AccountID, data1, min)
Fehler in model.frame.default(formula = PaymentDate ~ AccountID, data = data1) : ungültiger Typ (list) für die Variable 'PaymentDate'
显然,如果您需要时间聚合而不是按时间聚合,聚合不适用于posix时间。
但是必须可以获得第一个和最后一个交易日期吗?!
好的,既然我还不能回答我自己的问题,我会在这里发布:
有趣。谢谢!
我通常在read.csv中使用as.is = T选项,然后使用strptime转换时间。所以当我看到我的数据结构时,我得到了:
$ PaymentDate : POSIXlt, format: "2020-02-04" "2020-02-04" "2020-02-04" ...
对我来说,这看起来不是一个因素。我可以在整个列上使用min()和max()并且它可以工作。显然,POSIXlt比我想象的更麻烦。来自POSIXlt,我做了
data$PaymentDate=as.Date(data$PaymentDate)
查看结构,Class被正确设置为Date。
$ PaymentDate :Class 'Date' num [1:10000] 18296 18296 18296 18297 18297 ...
现在它似乎有效。但是,只有 ddply 会返回正确的格式“2020-01-25”,而聚合和 summaryBy 都会以“18286”格式返回。自1970-01-01以来的那几天?好吧,我想我可以把它转换回来。
foo=aggregate(PaymentDate ~ AccountID, data1, min)
as.Date(foo$PaymentDate,origin="1970-01-01")
然而,必须有一些解释。另外,ddply要慢得多。
哦,为什么我先使用strptime?好吧,原始文件中的日期格式不同,“%d-%m-%y”。直接在此使用as.Date似乎不起作用。
我的数据输入
structure(list(AccountID = c(17L, 17L, 17L, 17L, 17L, 17L, 17L,
17L, 17L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L,
359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L,
359L, 359L, 359L, 359L, 365L, 939L, 939L, 939L, 997L, 997L, 1181L
), PaymentDate = structure(list(sec = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
mday = c(4L, 4L, 4L, 5L, 5L, 5L, 5L, 9L, 25L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 4L, 4L, 17L, 8L, 17L, 28L, 8L, 22L, 3L),
mon = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 2L, 3L,
3L, 5L, 6L, 6L, 7L, 8L, 8L, 9L, 9L, 11L, 11L, 1L, 2L, 5L,
7L, 10L, 10L, 4L, 0L, 4L, 6L, 3L, 2L, 11L, 11L, 4L, 10L),
year = c(110L, 110L, 110L, 110L, 110L, 110L, 110L, 110L,
110L, 109L, 110L, 110L, 109L, 110L, 110L, 109L, 110L, 109L,
109L, 110L, 109L, 110L, 109L, 110L, 109L, 109L, 109L, 110L,
109L, 110L, 110L, 110L, 109L, 109L, 110L, 109L, 109L, 110L,
109L, 109L), wday = c(4L, 4L, 4L, 5L, 5L, 5L, 5L, 2L, 1L,
4L, 1L, 1L, 3L, 4L, 2L, 3L, 4L, 6L, 2L, 3L, 4L, 5L, 2L, 3L,
1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 5L, 4L, 2L, 1L, 3L, 5L,
2L), yday = c(34L, 34L, 34L, 35L, 35L, 35L, 35L, 39L, 24L,
0L, 31L, 59L, 90L, 90L, 151L, 181L, 181L, 212L, 243L, 243L,
273L, 273L, 334L, 334L, 32L, 60L, 152L, 213L, 305L, 305L,
122L, 3L, 123L, 197L, 97L, 75L, 361L, 341L, 141L, 306L),
isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT")), .Names = c("AccountID",
"PaymentDate"), row.names = c(NA, 40L), class = "data.frame")
按照您的建议行事后输入:
structure(list(AccountID = c(359L, 359L, 359L, 359L, 359L, 359L,
359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L,
359L, 359L, 359L, 359L, 359L, 359L, 359L, 365L, 939L, 939L, 939L,
997L, 997L, 1181L, 1181L, 1181L, 1181L, 1181L, 1181L, 1181L,
1181L, 1181L, 1181L), PaymentDate = structure(c(14245, 14277,
14305, 14335, 14368, 14397, 14426, 14457, 14488, 14518, 14550,
14579, 14613, 14641, 14669, 14700, 14732, 14761, 14791, 14823,
14853, 14883, 14915, 14944, 14442, 14320, 14606, 14707, 14386,
14951, 14293, 14432, 14477, 14540, 14540, 14540, 14540, 14540,
14540, 14551), class = "Date")), .Names = c("AccountID", "PaymentDate"
), row.names = c(10L, 25L, 26L, 13L, 33L, 27L, 16L, 18L, 19L,
21L, 29L, 23L, 32L, 11L, 12L, 14L, 31L, 15L, 17L, 28L, 20L, 22L,
30L, 24L, 34L, 36L, 37L, 35L, 39L, 38L, 45L, 42L, 48L, 50L, 51L,
52L, 53L, 54L, 55L, 40L), class = "data.frame")
原始数据输入
structure(list(AccountID = c(17L, 17L, 17L, 17L, 17L, 17L, 17L,
17L, 17L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L,
359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L,
359L, 359L, 359L, 359L, 365L, 939L, 939L, 939L, 997L, 997L, 1181L
), PaymentDate = c("04-02-2010", "04-02-2010", "04-02-2010",
"05-02-2010", "05-02-2010", "05-02-2010", "05-02-2010", "09-02-2010",
"25-01-2010", "01-01-2009", "01-02-2010", "01-03-2010", "01-04-2009",
"01-04-2010", "01-06-2010", "01-07-2009", "01-07-2010", "01-08-2009",
"01-09-2009", "01-09-2010", "01-10-2009", "01-10-2010", "01-12-2009",
"01-12-2010", "02-02-2009", "02-03-2009", "02-06-2009", "02-08-2010",
"02-11-2009", "02-11-2010", "03-05-2010", "04-01-2010", "04-05-2009",
"17-07-2009", "08-04-2010", "17-03-2009", "28-12-2009", "08-12-2010",
"22-05-2009", "03-11-2009")), .Names = c("AccountID", "PaymentDate"
), row.names = c(NA, 40L), class = "data.frame")
答案 0 :(得分:1)
问题在于您的数据,尤其是PaymentDate列是一个因素。如果您首先转换PaymentDate列,那么ddply
和aggregate
解决方案都将按照书面形式运作:
#Recreate data and use dput() to replicate
df <- structure(list(AccountID = c(13L, 13L, 14L, 14L), PaymentDate = c("2020-02-09",
"2020-01-25", "2020-01-01", "2020-02-01"), PaymentAmount = c(1,
4.2, 30.68, 30.68)), .Names = c("AccountID", "PaymentDate", "PaymentAmount"
), class = "data.frame", row.names = c("8", "9", "10", "11"))
将变量类更改为Date。
df$PaymentDate <- as.Date(df$PaymentDate)
然后运行原始代码。使用ddply:
ddply(df, .(AccountID), summarize, data=min(PaymentDate))
AccountID data
1 13 2020-01-25
2 14 2020-01-01
使用聚合:
aggregate(PaymentDate ~ AccountID, df, min)
AccountID PaymentDate
1 13 2020-01-25
2 14 2020-01-01
还有另一种更通用的方法来避免这个问题。默认情况下,当您使用read.table
(或其read.csv
等变体)创建data.frame时,参数stringsAsFactors
将设置为TRUE
。当我使用stringsAsFactors=FALSE
重新创建数据时,您不需要转换PaymentDate的中间步骤,并且您的代码按照书面形式工作:
dat <- " AccountID PaymentDate PaymentAmount
8 13 2020-02-09 1.00
9 13 2020-01-25 4.20
10 14 2020-01-01 30.68
11 14 2020-02-01 30.68 "
df <- read.table(textConnection(dat), stringsAsFactors=FALSE)
df
ddply(df, .(AccountID), summarize, data=min(PaymentDate))
AccountID data
1 13 2020-01-25
2 14 2020-01-01