我需要使用多行数据帧作为输入返回n个最常出现的字符串。所有值都在名为“MissingDates”的同一列中
以下是示例数据,总共大约有5000行:
Symbol Count MissingDates
AD 27 1995-12-26, 1996-01-02, 1996-04-26, 1996-04-30, 1996-05-06, 1996-08-26, 1996-09-03, 1996-09-04, 1996-10-11, 1996-11-13, 1996-11-29, 1996-12-09, 1996-12-20, 1996-12-23, 1996-12-26, 1996-12-27, 1997-01-02, 1997-05-02, 1997-09-10, 1998-01-02, 1998-04-16, 1998-12-08, 1999-12-27, 1999-12-31, 2001-09-12, 2003-08-06, 2003-10-13
BP 14 1995-08-09, 1995-08-15, 1995-12-26, 1996-01-02, 1996-09-06, 1996-12-26, 1997-01-02, 1997-12-26, 1998-01-02, 1998-04-16, 2001-09-12, 2002-12-24, 2003-08-06, 2003-10-13
C 3 1999-12-31, 2001-12-24, 2002-12-24
CC 285 1994-05-18, 1994-05-19, 1994-05-20, 1994-05-23, 1994-05-24, 1994-05-25, 1994-05-26, 1994-05-27, 1994-05-31, 1994-06-01, 1994-06-02, 1994-06-03, 1994-06-06, 1994-06-07, 1994-06-08, 1994-06-09, 1994-06-10, 1994-06-13, 1994-06-14, 1994-06-15, 1994-06-16, 1994-06-17, 1994-06-20, 1994-06-21, 1994-06-23, 1994-06-24, 1994-06-27, 1994-06-28, 1994-06-29, 1994-06-30, 1994-07-01, 1994-07-06, 1994-07-14, 1994-07-15, 1994-07-18, 1994-07-19, 1994-07-21, 1994-07-25, 1994-07-27, 1994-07-28, 1994-08-03, 1994-08-04, 1994-08-08, 1994-08-09, 1994-08-10, 1994-08-11, 1994-08-12, 1994-08-15, 1994-08-17, 1994-08-18, 1994-08-19, 1994-08-22, 1994-08-23, 1994-08-24, 1994-08-25, 1994-08-29, 1994-08-31, 1994-09-01, 1994-09-02, 1994-09-06, 1994-09-07, 1994-09-08, 1994-09-09, 1994-09-12, 1994-09-13, 1994-09-15, 1994-09-16, 1994-09-19, 1994-09-20, 1994-09-21, 1994-09-22, 1994-09-23, 1994-09-27, 1994-09-28, 1994-09-29, 1994-09-30, 1994-10-03, 1994-10-04, 1994-10-06, 1994-10-14, 1994-10-18, 1994-10-19, 1994-10-25, 1994-10-26, 1994-10-27, 1994-10-28, 1994-10-31, 1994-11-01, 1994-11-09, 1994-11-10, 1994-11-11, 1994-11-16, 1994-11-17, 1994-11-25, 1994-11-28, 1994-12-01, 1994-12-02, 1994-12-06, 1994-12-07, 1994-12-08, 1994-12-09, 1994-12-12, 1994-12-13, 1994-12-14, 1994-12-15, 1994-12-16, 1994-12-23, 1994-12-27, 1994-12-29, 1994-12-30, 1995-01-03, 1995-01-05, 1995-01-09, 1995-01-11, 1995-01-13, 1995-01-16, 1995-01-17, 1995-01-18, 1995-01-19, 1995-01-20, 1995-01-24, 1995-01-25, 1995-02-13, 1995-02-17, 1995-05-01, 1995-07-03, 1995-11-24, 1995-12-26, 1996-01-08, 1996-01-09, 1996-07-05, 1996-11-29, 1996-12-26, 1997-11-28, 1997-12-26, 1998-01-02, 1998-11-27, 1999-06-17, 1999-06-18, 1999-06-21, 1999-06-22, 1999-06-23, 1999-06-24, 1999-06-25, 1999-06-28, 1999-06-29, 1999-06-30, 1999-07-01, 1999-07-02, 1999-07-06, 1999-07-07, 1999-07-08, 1999-07-09, 1999-07-12, 1999-07-13, 1999-07-14, 1999-07-15, 1999-07-16, 1999-07-19, 1999-07-20, 1999-07-21, 1999-07-22, 1999-07-23, 1999-07-26, 1999-07-27, 1999-07-28, 1999-07-29, 1999-07-30, 1999-08-02, 1999-08-03, 1999-08-04, 1999-08-05, 1999-08-06, 1999-08-09, 1999-08-10, 1999-08-11, 1999-08-12, 1999-08-13, 1999-08-16, 1999-08-17, 1999-08-18, 1999-08-19, 1999-08-20, 1999-08-23, 1999-08-24, 1999-08-25, 1999-08-26, 1999-08-27, 1999-08-30, 1999-08-31, 1999-09-01, 1999-09-02, 1999-09-03, 1999-09-07, 1999-09-08, 1999-09-09, 1999-09-10, 1999-09-13, 1999-09-14, 1999-09-15, 1999-09-16, 1999-09-17, 1999-09-20, 1999-09-21, 1999-09-22, 1999-09-23, 1999-09-24, 1999-09-27, 1999-09-28, 1999-09-29, 1999-09-30, 1999-10-01, 1999-10-04, 1999-10-05, 1999-10-06, 1999-10-07, 1999-10-08, 1999-10-11, 1999-10-12, 1999-10-13, 1999-10-14, 1999-10-15, 1999-10-18, 1999-10-19, 1999-10-20, 1999-10-21, 1999-10-22, 1999-10-25, 1999-10-26, 1999-10-27, 1999-10-28, 1999-10-29, 1999-11-01, 1999-11-02, 1999-11-03, 1999-11-04, 1999-11-05, 1999-11-08, 1999-11-09, 1999-11-10, 1999-11-11, 1999-11-12, 1999-11-15, 1999-11-16, 1999-11-17, 1999-11-18, 1999-11-19, 1999-11-22, 1999-11-23, 1999-11-24, 1999-11-26, 1999-11-29, 1999-11-30, 1999-12-01, 1999-12-02, 1999-12-03, 1999-12-06, 1999-12-07, 1999-12-08, 1999-12-09, 1999-12-10, 1999-12-13, 1999-12-31, 2000-07-03, 2000-11-24, 2001-09-13, 2001-09-14, 2001-11-23, 2001-12-24, 2001-12-26, 2001-12-31, 2002-07-05, 2002-11-29, 2002-12-26, 2003-02-18, 2003-11-28, 2004-06-11, 2004-11-26, 2004-12-31, 2005-11-25, 2006-11-24, 2007-01-02, 2007-11-23, 2007-12-24, 2011-01-03
CD 14 1995-08-09, 1995-12-26, 1996-01-02, 1996-06-11, 1996-06-20, 1996-09-09, 1996-09-11, 1996-12-26, 1997-01-02, 1997-12-26, 1998-01-02, 1998-04-16, 2001-01-02, 2001-09-12
CT 154 1995-11-24, 1996-01-08, 1996-07-05, 1996-11-29, 1996-12-24, 1997-11-28, 1997-12-26, 1998-11-27, 1999-11-26, 1999-12-31, 2000-07-03, 2000-11-24, 2001-09-11, 2001-09-12, 2001-09-13, 2001-09-14, 2001-11-12, 2001-11-23, 2001-12-24, 2001-12-31, 2002-05-21, 2002-05-22, 2002-05-23, 2002-05-24, 2002-05-28, 2002-05-29, 2002-05-30, 2002-05-31, 2002-06-03, 2002-06-04, 2002-06-05, 2002-06-06, 2002-06-07, 2002-06-10, 2002-06-11, 2002-06-12, 2002-06-13, 2002-06-14, 2002-06-17, 2002-06-18, 2002-06-19, 2002-06-20, 2002-06-21, 2002-06-24, 2002-06-25, 2002-06-26, 2002-06-27, 2002-06-28, 2002-07-01, 2002-07-02, 2002-07-03, 2002-07-05, 2002-07-08, 2002-07-09, 2002-07-10, 2002-07-11, 2002-07-12, 2002-07-15, 2002-07-16, 2002-07-17, 2002-07-18, 2002-07-19, 2002-07-22, 2002-07-23, 2002-07-24, 2002-07-25, 2002-07-26, 2002-07-29, 2002-07-30, 2002-07-31, 2002-08-01, 2002-08-02, 2002-08-05, 2002-08-06, 2002-08-07, 2002-08-08, 2002-08-09, 2002-08-12, 2002-08-13, 2002-08-14, 2002-08-15, 2002-08-16, 2002-08-19, 2002-08-20, 2002-08-21, 2002-08-22, 2002-08-23, 2002-08-26, 2002-08-27, 2002-08-28, 2002-08-29, 2002-08-30, 2002-09-03, 2002-09-04, 2002-09-05, 2002-09-06, 2002-09-09, 2002-09-10, 2002-09-11, 2002-09-12, 2002-09-13, 2002-09-16, 2002-09-17, 2002-09-18, 2002-09-19, 2002-09-20, 2002-09-23, 2002-09-24, 2002-09-25, 2002-09-26, 2002-09-27, 2002-09-30, 2002-10-01, 2002-10-02, 2002-10-03, 2002-10-04, 2002-10-07, 2002-10-08, 2002-10-09, 2002-10-10, 2002-10-11, 2002-10-14, 2002-10-15, 2002-10-16, 2002-10-17, 2002-10-18, 2002-10-21, 2002-10-22, 2002-10-23, 2002-10-24, 2002-10-25, 2002-10-28, 2002-10-29, 2002-10-30, 2002-10-31, 2002-11-01, 2002-11-04, 2002-11-05, 2002-11-06, 2002-11-07, 2002-11-29, 2002-12-24, 2003-02-18, 2003-11-28, 2003-12-26, 2004-01-02, 2004-06-11, 2004-11-26, 2004-12-31, 2005-11-25, 2006-11-24, 2007-01-02, 2007-11-23, 2007-12-24
因此该函数将传递一个参数,它将从data.frame返回上面n个日期最常出现的日期。
我查看了哪个.max但是无法弄清楚如何将它应用于多行(整个数据框列),或者给我多个单独的日期(n)作为输出。
如果代码只有一个输出值会更简单,那么这是我可以接受的起点。任何指针都表示赞赏。
这是一个pastebin,因为我的字符串长度有问题: http://pastebin.com/B1YPicC8
> str(gaps) 'data.frame': 5560 obs. of 3 variables: $ Symbol : Factor w/ 5560 levels "@AD#","@BP#",..: 1 2 3 4 5 6 7 8 9 10 ... $ Count : int 27 14 3 285 14 154 540 11 3 11 ... $ MissingDates: Factor w/ 3568 levels "1995-12-26, 1996-01-02, 1996-04-26, 1996-04-30, 1996-05-06, 1996-08-26, 1996-09-03, 1996-09-04, 1996-10-11, 1996-11-13, 1996-11"| __truncated__,..: 1 2 3 4 5 6 7 8 9 10 ...
答案 0 :(得分:4)
看起来你需要这样的东西:
功能
freqfunc <- function(x, n){
tail(sort(table(unlist(strsplit(as.character(x), ", ")))), n)
}
测试您的数据集
freqfunc(gaps$MissingDates, 5) # Five most frequent dates
## 1996-12-26 1997-12-26 1998-01-02 1999-12-31 2001-09-12
## 4 4 4 4 4
答案 1 :(得分:1)
这是另一种可能的解决方案。
设置一些数据,
set.seed(5)
ss1 <- sample(seq(s <- Sys.Date(), s+10, "day"), 20, TRUE)
ss2 <- sample(seq(s <- Sys.Date(), s+10, "day"), 20, TRUE)
ls1 <- list(ss1 = ss1, ss2 = ss2)
定义功能:
f <- function(x, n) sort(table(x), decreasing = TRUE)[1:n]
将该功能应用于数据:
lapply(ls1, f, n = 3)
# $ss1
# x
# 2014-09-08 2014-09-09 2014-09-07
# 3 3 2
#
# $ss2
# x
# 2014-09-10 2014-09-06 2014-09-07
# 4 3 2