使用Dates合并R中的两个文本文件,而不会出现内存分配错误

时间:2015-02-08 05:10:34

标签: r

以下代表我目前面临的问题。我看到文件大小突然增加并且挂起。我试图寻找甚至xdf格式,但由于文件很小,它应该是一个问题。

我有两个库存数据S& P500和1个分钟数据的GE库存。由于数据集不完整,两者之间的日期存在差异。我必须根据它们之间的常见日期组合数据集,然后按时间组合。

GE股票数据:截至2015年约为40万桶。行数= 980465。数据框的名称为 GE_Last

 Date   Time  Open  High   Low Close Volume
1: 2007-04-27 145900 36.73 36.74 36.70 36.70  40900
2: 2007-04-27 150000 36.71 36.72 36.70 36.71  50100
3: 2007-04-27 150100 36.71 36.73 36.69 36.70 167550
4: 2007-04-27 150200 36.70 36.71 36.68 36.69  81900
5: 2007-04-27 150300 36.69 36.73 36.68 36.71 153500
6: 2007-04-27 150400 36.71 36.72 36.70 36.70  86600

S& P500股票数据:截至2015年约为3400万,行数2220101。数据框名称为 ES_Last

         Date   Time    Open    High     Low   Close Volume
1: 2007-12-09 230100 1517.00 1517.00 1516.75 1516.75      2
2: 2007-12-09 230700 1516.00 1516.00 1515.75 1515.75      2
3: 2007-12-09 230900 1515.50 1515.50 1515.25 1515.25      2
4: 2007-12-09 232700 1516.00 1516.00 1516.00 1516.00      1
5: 2007-12-09 233100 1515.75 1515.75 1515.75 1515.75      1


Combined = merge(GE_Last,ES_Last,by="Date",all.x=TRUE)

执行合并后会抛出错误并挂起:

Error: cannot allocate vector of size 4.1 Gb
In addition: Warning messages:
1: In NextMethod("[") :
  Reached total allocation of 16296Mb: see help(memory.size)
2: In NextMethod("[") :
  Reached total allocation of 16296Mb: see help(memory.size)
3: In NextMethod("[") :
  Reached total allocation of 16296Mb: see help(memory.size)
4: In NextMethod("[") :
  Reached total allocation of 16296Mb: see help(memory.size)

2 个答案:

答案 0 :(得分:1)

正如已经指出的那样,如果您希望数据正确合并,则应使用唯一键。

library(data.table)
library(stringr) # string manipulation - just to help recreate data
library(dplyr) # data manipulation
library(lubridate) # times and dates manipulation
library(tidyr) # for tidying data - just to help recreate data
library(sqldf) # using SQL might help with memory issues

# first, lets recreate your data

N = 100000

df1 <- data.table(Date_time = as.character(seq(c(ISOdate(2000,1,1)), by = "min", length.out = N)),
             Open        = rnorm(N, mean = 36),
             High        = rnorm(N, mean = 36),
             Low         = rnorm(N, mean = 36),
             Close       = rnorm(N, mean = 36),
             Volume      = rpois(N, lambda = 40000)) %>% 
  separate(Date_time, c("Date", "Time"), sep = " ") %>% 
  mutate(Time = str_replace_all(Time, ":", ""))

N = 200000

df2 <- data.table(Date_time = as.character(seq(c(ISOdate(2000,1,1)), by = "min", length.out = N)),
              Open        = rnorm(N, mean = 36),
              High        = rnorm(N, mean = 36),
              Low         = rnorm(N, mean = 36),
              Close       = rnorm(N, mean = 36),
              Volume      = rpois(N, lambda = 40000)) %>% 
  separate(Date_time, c("Date", "Time"), sep = " ") %>% 
  mutate(Time = str_replace_all(Time, ":", ""))

所以,现在我们需要创建唯一键。正如你所说,你有1分钟的数据,所以我们创建一分钟的密钥

df1 <- df1 %>% mutate(Date_time = ymd_hms(paste0(Date, Time))) # ymd_hms from lubridate is good at converting various date char strings into R dates
df2 <- df2 %>% mutate(Date_time = ymd_hms(paste0(Date, Time)))

现在我们有了键,让我们合并

merged1 <- merge(df1, df2, by = "Date_time", all.x = T)

# or, if you have memory issues, sqldf can help. At least it helped me a few times at work.

merged2 <- sqldf(
  "SELECT df1.* 
    ,df2.Open
    ,df2.High
    ,df2.Low
    ,df2.Close
    ,df2.Volume
   FROM df1
   LEFT JOIN df2
   ON df1.Date_time = df2.Date_time") %>% 
  tbl_dt

答案 1 :(得分:0)

MrFlick所述,日期似乎不是唯一的关键。如果要保留所有数据,请快速解决两个数据集,然后按日期对它们进行排序:

allData<-rbind(GE_Last,ES_Last)
o<-order(allData$Date)
sortedAllData<-allData[o,]

我希望这会有所帮助。 亲切的问候

编辑:

Drvi的回答似乎是正确的。我只想补充一点,你可以直接这样做:

merge(GE_Last,ES_Last,by=c("Date","Time"),all.x=TRUE)