在R数据框

时间:2016-04-02 15:55:41

标签: r sorting dataframe

我有一个很大的data.frame个债券数据,就像那样:

   ISIN      CF       DATE
    A   105.750  2016-09-30
    B   104.875  2016-05-31
    C   106.875  2017-02-13
    D   103.875  2016-10-07
    E   5.000    2016-04-21
    E   5.000    2017-04-21
    E   5.000    2018-04-21
    E   5.000    2019-04-21
    E   105.000  2020-04-21
    F   7.800    2016-09-09
    F   7.800    2017-09-09
    F   7.800    2018-09-09
    F   7.800    2019-09-09
    F   107.800  2020-09-09

我希望按ISIN代码对元素进行分组,然后在这些组中按CF按递增顺序对DATE元素进行排序(已在上面的示例中完成)。然后我想对这些群组进行排序(ABCDEF,以便具有最早日期的组首先出现,然后是具有第二个最早日期的组,依此类推。

我希望它看起来像这样:

  ISIN     CF      DATE
    E   5.000   2016-04-21
    E   5.000   2017-04-21
    E   5.000   2018-04-21
    E   5.000   2019-04-21
    E   105.000 2020-04-21
    B   104.875 2016-05-31
    F    7.800  2016-09-09
    F    7.800  2017-09-09
    F    7.800  2018-09-09
    F    7.800  2019-09-09
    F   107.800 2020-09-09
    A   105.750 2016-09-30
    D   103.875 2016-10-07
    C   106.875 2017-02-13

我尝试过这个问题:

  

How to sort a dataframe by column(s)?

df<-df[order(df$ISIN,df$DATE),]

但它没有做我想做的事。

2 个答案:

答案 0 :(得分:3)

这可以完成工作 - 基本上,首先按最小日期创建每个ISIN的等级,然后按该等级排序:

library(data.table)
setDT(DF)

DF[DF[ , min(DATE), by = ISIN
       ][ , .(ISIN, rank = frank(V1))
          ], on = "ISIN"
   ][order(rank, DATE)]
#     ISIN      CF       DATE rank
#  1:    E   5.000 2016-04-21    1
#  2:    E   5.000 2017-04-21    1
#  3:    E   5.000 2018-04-21    1
#  4:    E   5.000 2019-04-21    1
#  5:    E 105.000 2020-04-21    1
#  6:    B 104.875 2016-05-31    2
#  7:    F   7.800 2016-09-09    3
#  8:    F   7.800 2017-09-09    3
#  9:    F   7.800 2018-09-09    3
# 10:    F   7.800 2019-09-09    3
# 11:    F 107.800 2020-09-09    3
# 12:    A 105.750 2016-09-30    4
# 13:    D 103.875 2016-10-07    5
# 14:    C 106.875 2017-02-13    6

如果您想避免创建副本,请改为:

DF[DF[ , min(DATE), by = ISIN
       ][ , .(ISIN, rank = frank(V1))
          ], rank := rank, on = "ISIN"]

setorder(DF, rank, DATE)

如果您不想创建rank列,请改用factor levels

ord <- DF[ , min(DATE), by = ISIN][ , ISIN[frank(V1)]]

DF[ , ISIN := factor(ISIN, levels = ord)]
DF[order(ISIN, DATE)]
#     ISIN      CF       DATE
#  1:    E   5.000 2016-04-21
#  2:    E   5.000 2017-04-21
#  3:    E   5.000 2018-04-21
#  4:    E   5.000 2019-04-21
#  5:    E 105.000 2020-04-21
#  6:    B 104.875 2016-05-31
#  7:    F   7.800 2016-09-09
#  8:    F   7.800 2017-09-09
#  9:    F   7.800 2018-09-09
# 10:    F   7.800 2019-09-09
# 11:    F 107.800 2020-09-09
# 12:    A 105.750 2016-09-30
# 13:    D 103.875 2016-10-07
# 14:    C 106.875 2017-02-13

您也可以在base中执行此操作,但速度会慢一些:

ord <- names(sort(by(DF, DF$ISIN, function(x) min(x$DATE))))

DF$ISIN <- factor(DF$ISIN, levels = ord)

DF[with(DF, order(ISIN, DATE)),]

答案 1 :(得分:2)

使用dplyr,您可以执行以下操作:

library(dplyr)
df %>% group_by(ISIN) %>% 
  mutate(minDate = paste0(min(DATE), ISIN)) %>% 
  arrange(DATE) %>% ungroup() %>% arrange(minDate) %>%
  select(-minDate)

请注意,临时minDate列还包含ISIN,以便您可以解决具有相同值的两个最小日期的情况。改变mutate(minDate = paste0(min(DATE),ISIN)) - &gt; mutate(minDate = min(DATE))来摆脱这个