从data.frame按时间条件选择数据

时间:2017-11-25 16:17:52

标签: r dataframe

下午好!现在我想自动处理财务数据,我遇到了如何从data.frame中选择所需数据的问题。

例如,我有以下head of data.frame

               period bid_open bid_high bid_low bid_close ask_open ask_high 
1 2015-01-02 00:00:00  1.20860  1.20880 1.20860   1.20870  1.20890  1.20890 
2 2015-01-02 00:01:00  1.20870  1.20880 1.20865   1.20865  1.20880  1.20890 
3 2015-01-02 00:02:00  1.20865  1.20880 1.20865   1.20875  1.20875  1.20885 
4 2015-01-02 00:03:00  1.20875  1.20885 1.20875   1.20885  1.20885  1.20900 
5 2015-01-02 00:04:00  1.20885  1.20885 1.20880   1.20880  1.20895  1.20895 
6 2015-01-02 00:05:00  1.20880  1.20885 1.20880   1.20880  1.20890  1.20895 

主要关注的是第一列period - 数据的时间频率可以是1m(如下所示),1s,1h,1d。我想编写将包含参数frequency的函数。例如,如果frequency=2h,函数输出是新的data.frame,其中包含2h的观察值(股票价格):

2015-01-02 00:00:00
2015-01-02 02:00:00
2015-01-02 04:00:00
....

如果频率为15s(f.e.),则R必须输出初始数据帧,因为初始数据的频率为1米。

但我有几个问题要实现这个任务。你能帮帮我吗?

我的逻辑是:

首先,找到初始频率:

    time=data[,1]
freq=as.numeric(difftime(time[2], time[1]))

但问题是R只显示数字(在这种情况下为freq=1)而我不知道它是1m还是1h或1d。如何纠正?

  1. 其次,f.e。我将获得freq=5m,但我的数据频率为1米,因此我需要更正我的表并仅保留1st,6th,11th...行。我该怎么做? 谢谢!

1 个答案:

答案 0 :(得分:0)

以下是可能的解决方案之一:

  # 1. Load library
  library(dplyr)

   # 2. Data set sample
   df <- data.frame(
      period = c("2015-01-02 00:00:00", "2015-01-02 00:01:00", "2015-01-02 00:02:00", "2015-01-02 00:03:00", "2015-01-02 00:04:00", "2015-01-02 00:05:00"),
      bid_open = c(1.20860, 1.20870, 1.20865, 1.20875, 1.20885, 00:05:00))

    # 3. Feature engineering
   df <- df %>% mutate(
     year = as.numeric(substr(period, 1, 4)),
     month = as.numeric(substr(period, 6, 7)),
     day = as.numeric(substr(period, 9, 10)),
     hour = as.numeric(substr(period, 12, 13)),
     min = as.numeric(substr(period, 15, 16)),
     sec = as.numeric(substr(period, 18, 19)))

  # 4. Select data function
  select_data <- function(df, str_frequency){

     # 1. Define frequency parameters
     frequency_value <- as.numeric(substr(str_frequency, 1, 2))
     frequency_type <- substr(str_frequency, 3, nchar(str_frequency))

     # 2. Calculate result by using modulus operator %%
     df_result <- df[!(df[, c(frequency_type)] %% frequency_value), ]

     # 3. Return result
     return(df_result)
  }

# 5. Test (filter for "02min" as a basic test)
select_data(df, "01year")
select_data(df, "01month")
select_data(df, "01day")
select_data(df, "01hour")
select_data(df, "02min") # should filter here / change to "03min" also works
select_data(df, "01sec")