以下代表我目前面临的问题。我看到文件大小突然增加并且挂起。我试图寻找甚至xdf格式,但由于文件很小,它应该是一个问题。
我有两个库存数据S& P500和1个分钟数据的GE库存。由于数据集不完整,两者之间的日期存在差异。我必须根据它们之间的常见日期组合数据集,然后按时间组合。
GE股票数据:截至2015年约为40万桶。行数= 980465。数据框的名称为 GE_Last
Date Time Open High Low Close Volume
1: 2007-04-27 145900 36.73 36.74 36.70 36.70 40900
2: 2007-04-27 150000 36.71 36.72 36.70 36.71 50100
3: 2007-04-27 150100 36.71 36.73 36.69 36.70 167550
4: 2007-04-27 150200 36.70 36.71 36.68 36.69 81900
5: 2007-04-27 150300 36.69 36.73 36.68 36.71 153500
6: 2007-04-27 150400 36.71 36.72 36.70 36.70 86600
S& P500股票数据:截至2015年约为3400万,行数2220101。数据框名称为 ES_Last
Date Time Open High Low Close Volume
1: 2007-12-09 230100 1517.00 1517.00 1516.75 1516.75 2
2: 2007-12-09 230700 1516.00 1516.00 1515.75 1515.75 2
3: 2007-12-09 230900 1515.50 1515.50 1515.25 1515.25 2
4: 2007-12-09 232700 1516.00 1516.00 1516.00 1516.00 1
5: 2007-12-09 233100 1515.75 1515.75 1515.75 1515.75 1
Combined = merge(GE_Last,ES_Last,by="Date",all.x=TRUE)
执行合并后会抛出错误并挂起:
Error: cannot allocate vector of size 4.1 Gb
In addition: Warning messages:
1: In NextMethod("[") :
Reached total allocation of 16296Mb: see help(memory.size)
2: In NextMethod("[") :
Reached total allocation of 16296Mb: see help(memory.size)
3: In NextMethod("[") :
Reached total allocation of 16296Mb: see help(memory.size)
4: In NextMethod("[") :
Reached total allocation of 16296Mb: see help(memory.size)
答案 0 :(得分:1)
正如已经指出的那样,如果您希望数据正确合并,则应使用唯一键。
library(data.table)
library(stringr) # string manipulation - just to help recreate data
library(dplyr) # data manipulation
library(lubridate) # times and dates manipulation
library(tidyr) # for tidying data - just to help recreate data
library(sqldf) # using SQL might help with memory issues
# first, lets recreate your data
N = 100000
df1 <- data.table(Date_time = as.character(seq(c(ISOdate(2000,1,1)), by = "min", length.out = N)),
Open = rnorm(N, mean = 36),
High = rnorm(N, mean = 36),
Low = rnorm(N, mean = 36),
Close = rnorm(N, mean = 36),
Volume = rpois(N, lambda = 40000)) %>%
separate(Date_time, c("Date", "Time"), sep = " ") %>%
mutate(Time = str_replace_all(Time, ":", ""))
N = 200000
df2 <- data.table(Date_time = as.character(seq(c(ISOdate(2000,1,1)), by = "min", length.out = N)),
Open = rnorm(N, mean = 36),
High = rnorm(N, mean = 36),
Low = rnorm(N, mean = 36),
Close = rnorm(N, mean = 36),
Volume = rpois(N, lambda = 40000)) %>%
separate(Date_time, c("Date", "Time"), sep = " ") %>%
mutate(Time = str_replace_all(Time, ":", ""))
所以,现在我们需要创建唯一键。正如你所说,你有1分钟的数据,所以我们创建一分钟的密钥
df1 <- df1 %>% mutate(Date_time = ymd_hms(paste0(Date, Time))) # ymd_hms from lubridate is good at converting various date char strings into R dates
df2 <- df2 %>% mutate(Date_time = ymd_hms(paste0(Date, Time)))
现在我们有了键,让我们合并
merged1 <- merge(df1, df2, by = "Date_time", all.x = T)
# or, if you have memory issues, sqldf can help. At least it helped me a few times at work.
merged2 <- sqldf(
"SELECT df1.*
,df2.Open
,df2.High
,df2.Low
,df2.Close
,df2.Volume
FROM df1
LEFT JOIN df2
ON df1.Date_time = df2.Date_time") %>%
tbl_dt
答案 1 :(得分:0)
如MrFlick所述,日期似乎不是唯一的关键。如果要保留所有数据,请快速解决两个数据集,然后按日期对它们进行排序:
allData<-rbind(GE_Last,ES_Last)
o<-order(allData$Date)
sortedAllData<-allData[o,]
我希望这会有所帮助。 亲切的问候
编辑:
Drvi的回答似乎是正确的。我只想补充一点,你可以直接这样做:
merge(GE_Last,ES_Last,by=c("Date","Time"),all.x=TRUE)