根据月 - 年时间格式对数据框进行排序

时间:2011-01-04 10:08:58

标签: r sorting time

我正在努力解决一些非常基本的问题:根据时间格式对数据框进行排序(月份,或者,在这种情况下为“%B-%y”)。我的目标是计算各种月度统计数据,从总和开始。

数据框架相关部分的部分看起来像这样*(这很好,并且符合我的目标。我在此处包含它以显示可能来源的问题)* :

> tmp09
   Instrument AccountValue   monthYear   ExitTime
1         JPM         6997    april-07 2007-04-10
2         JPM         7261      mei-07 2007-05-29
3         JPM         7545     juli-07 2007-07-18
4         JPM         7614     juli-07 2007-07-19
5         JPM         7897 augustus-07 2007-08-22
10        JPM         7423 november-07 2007-11-02
11        KFT         6992      mei-07 2007-05-14
12        KFT         6944      mei-07 2007-05-21
13        KFT         7069     juli-07 2007-07-09
14        KFT         6919     juli-07 2007-07-16
# Order on the exit time, which corresponds with 'monthYear'
> tmp09.sorted <- tmp09[order(tmp09$ExitTime),]
> tmp09.sorted
   Instrument AccountValue   monthYear   ExitTime
1         JPM         6997    april-07 2007-04-10
11        KFT         6992      mei-07 2007-05-14
12        KFT         6944      mei-07 2007-05-21
2         JPM         7261      mei-07 2007-05-29
13        KFT         7069     juli-07 2007-07-09
14        KFT         6919     juli-07 2007-07-16
3         JPM         7545     juli-07 2007-07-18
4         JPM         7614     juli-07 2007-07-19
5         JPM         7897 augustus-07 2007-08-22
10        JPM         7423 november-07 2007-11-02

到目前为止,这么好,基于ExitTime的排序工作。 当我尝试计算每月的总数时,麻烦就开始了,然后尝试对此输出进行排序

# Calculate the total results per month
> Tmp09Totals <- tapply(tmp09.sorted$AccountValue, tmp09.sorted$monthYear, sum)
> Tmp09Totals <- data.frame(Tmp09Totals)
> Tmp09Totals
            Tmp09Totals
april-07           6997
augustus-07        7897
juli-07           29147
mei-07            21197
november-07        7423

如何按时间顺序对此输出进行排序?

我已经尝试过(除了将monthYear转换为另一种日期格式的各种尝试):order,sort,sort.list,sort_df,reshape,以及基于tapply,lapply,sapply,aggregate计算总和。甚至重写rownames(通过给他们一个从1到长度(tmp09.sorted2$AccountValue)的数字也不起作用。我还尝试根据我在另一个问题中学到的内容给每个月份一个不同的ID,但是,R在区分不同的月份价值方面也遇到了困难。

此输出的正确顺序为april-07,mei-07,juli-07,augustus07, november-07

apr-07  6997
mei-07  21197
jul-07  29147
aug-07  7897
nov-07  7423

6 个答案:

答案 0 :(得分:9)

以正确的顺序使用单独的MonthYear因子会更容易,并且在两个变量的并集上使用tapply,例如:

## The Month factor
tmp09 <- within(tmp09,
                Month <- droplevels(factor(strftime(ExitTime, format = "%B"),
                                                    levels = month.name)))
## for @Jura25's locale, we can't use the in built English constant
## instead, we can use this solution, from ?month.name:
## format(ISOdate(2000, 1:12, 1), "%B"))
tmp09 <- within(tmp09,
                Month <- droplevels(factor(strftime(ExitTime, format = "%B"),
                                                    levels = format(ISOdate(2000, 1:12, 1), "%B"))))
##
## And the Year factor
tmp09 <- within(tmp09, Year <- factor(strftime(ExitTime, format = "%Y")))

这给了我们(在我的语言环境中):

> head(tmp09)
   Instrument AccountValue   monthYear   ExitTime    Month Year
1         JPM         6997    april-07 2007-04-10    April 2007
2         JPM         7261      mei-07 2007-05-29      May 2007
3         JPM         7545     juli-07 2007-07-18     July 2007
4         JPM         7614     juli-07 2007-07-19     July 2007
5         JPM         7897 augustus-07 2007-08-22   August 2007
10        JPM         7423 november-07 2007-11-02 November 2007

然后使用tapply两个因素:

> with(tmp09, tapply(AccountValue, list(Month, Year), sum))
          2007
April     6997
May      21197
July     29147
August    7897
November  7423

或通过aggregate

> with(tmp09, aggregate(AccountValue, list(Month = Month, Year = Year), sum))
     Month Year     x
1    April 2007  6997
2      May 2007 21197
3     July 2007 29147
4   August 2007  7897
5 November 2007  7423

答案 1 :(得分:4)

尝试在动物园中使用"yearmon"类,因为它会进行适当的排序。下面我们创建示例DF数据框,然后我们添加一个YearMonth"yearmon"列。最后,我们执行聚合。实际处理只是最后两行(另一部分只是创建样本数据框)。

Lines <-   "Instrument AccountValue   monthYear   ExitTime
JPM         6997    april-07 2007-04-10
JPM         7261      mei-07 2007-05-29
JPM         7545     juli-07 2007-07-18
JPM         7614     juli-07 2007-07-19
JPM         7897 augustus-07 2007-08-22
JPM         7423 november-07 2007-11-02
KFT         6992      mei-07 2007-05-14
KFT         6944      mei-07 2007-05-21
KFT         7069     juli-07 2007-07-09
KFT         6919     juli-07 2007-07-16"
library(zoo)
DF <- read.table(textConnection(Lines), header = TRUE)

DF$YearMonth <- as.yearmon(DF$ExitTime)
aggregate(AccountValue ~ YearMonth + Instrument, DF, sum)

这给出了以下内容:

> aggregate(AccountValue ~ YearMonth + Instrument, DF, sum)
  YearMonth Instrument AccountValue
1  Apr 2007        JPM         6997
2  May 2007        JPM         7261
3  Jul 2007        JPM        15159
4  Aug 2007        JPM         7897
5  Nov 2007        JPM         7423
6  May 2007        KFT        13936
7  Jul 2007        KFT        13988

略有不同的方法和输出直接使用read.zoo。它每个仪器产生一列,每年/每月产生一行。我们在列中使用"NULL"monthYear列分配适当的类,因为我们不会使用该列。我们还指定时间索引是剩余列的第3列,我们希望输入按第1列拆分为列。 FUN=as.yearmon表示我们希望时间索引从"Date"类转换为"yearmon"类,并使用sum汇总所有内容。

z <- read.zoo(textConnection(Lines),  header = TRUE, index = 3, 
     split = 1, colClasses = c("character", "numeric", "NULL", "Date"),
     FUN = as.yearmon, aggregate = sum)

生成的zoo对象如下所示:

> z
           JPM   KFT
Apr 2007  6997    NA
May 2007  7261 13936
Jul 2007 15159 13988
Aug 2007  7897    NA
Nov 2007  7423    NA

我们可能更喜欢将它保留为动物园对象以利用动物园中的其他功能,或者我们可以将其转换为如下数据框:data.frame(Time = time(z), coredata(z))这使得时间成为单独的列或{{1}它使用行名称的时间。 as.data.frame(z)也有效。

答案 2 :(得分:3)

您可以按reorder函数重新排序因子级别。

tmp09$monthYear <- reorder(tmp09$monthYear, as.numeric(as.Date(tmp09$ExitTime)))

诀窍是使用日期的数字表示作为1970-01-01以来的天数(参见?Date)并使用它的平均值作为参考。

答案 3 :(得分:1)

编辑:我最初误解了这个问题。首先复制问题中给出的数据,然后

> tmp09 <- read.table(file="clipboard", header=TRUE)
> Sys.setlocale(category="LC_TIME", locale="Dutch_Belgium.1252")
[1] "Dutch_Belgium.1252"

# create POSIXlt variable from monthYear
> tmp09$d <- strptime(paste("2007", tmp09$monthYear, sep="-"), "%Y-%B-%d")

# create ordered factor
> tmp09$dFac <- droplevels(cut(tmp09$d, breaks="month", ordered=TRUE))
> tmp09[order(tmp09$d), ]
   Instrument AccountValue   monthYear   ExitTime          d       dFac
1         JPM         6997    april-07 2007-04-10 2007-04-07 2007-04-01
2         JPM         7261      mei-07 2007-05-29 2007-05-07 2007-05-01
11        KFT         6992      mei-07 2007-05-14 2007-05-07 2007-05-01
12        KFT         6944      mei-07 2007-05-21 2007-05-07 2007-05-01
3         JPM         7545     juli-07 2007-07-18 2007-07-07 2007-07-01
4         JPM         7614     juli-07 2007-07-19 2007-07-07 2007-07-01
13        KFT         7069     juli-07 2007-07-09 2007-07-07 2007-07-01
14        KFT         6919     juli-07 2007-07-16 2007-07-07 2007-07-01
5         JPM         7897 augustus-07 2007-08-22 2007-08-07 2007-08-01
10        JPM         7423 november-07 2007-11-02 2007-11-07 2007-11-01

> Tmp09Totals <- tapply(tmp09$AccountValue, tmp09$dFac, sum)
> Tmp09Totals
2007-04-01 2007-05-01 2007-07-01 2007-08-01 2007-11-01 
      6997      21197      29147       7897       7423

答案 4 :(得分:1)

看起来主要问题是如何按时间顺序对一系列Month-Year字符串进行排序。最简单的方法是在每个Month-Year字符串的开头预先挂起“01”并将它们排序为常规日期。所以采取你的最终数据框架Tmp09Totals,并执行此操作:

monYear <- rownames(Tmp09Totals)
sortedMonYear <- format(sort( as.Date( paste('01-', monYear, sep = ''),
                                       '%d-%B-%y')), 
                       '%B-%y')
Tmp09Totals[ sortedMonYear, , drop = FALSE]

答案 5 :(得分:0)

旧帖但值得采用data.table方法:

按照@caracal

的描述读入数据并设置本地
> Sys.setlocale(category="LC_TIME", locale="Dutch_Belgium.1252")
[1] "Dutch_Belgium.1252"
> tmp09 <- read.table(file="clipboard", header=TRUE)
> tmp09$ExitTime <- as.Date(tmp09$ExitTime)

按要求汇总数据

require(data.table)
> data.table(tmp09)[, 
+                   .(Tmp09Total = sum(AccountValue)),
+                   by = .(Date = format(ExitTime, "%B-%y"))]
          Date Tmp09Total
1:    april-07       6997
2:      mei-07      21197
3:     juli-07      29147
4: augustus-07       7897
5: november-07       7423