我正在尝试从两个较小的数据集中模拟新数据集。对我来说保持边缘是很重要的 从最终数据集中的这些较小数据集开始计算。希望这个可重复的例子可以解释什么 我的意思是。
library(data.table) # 1.10.5
set.seed(123)
meanVal <- 40
我在这里模拟一些年龄和性别数据。每个地点将始终有2个性别级别和100个年龄级别。
demoDat <- CJ(with(CJ(letters,letters[1:5]), paste0(V1,V2)), c("M","F"), 0:99)
setnames(demoDat, c("Location","Gender","Age"))
demoDat[, Val := rpois(.N, meanVal)]
Location Gender Age Val
1: aa F 0 36
2: aa F 1 47
3: aa F 2 29
---
25998: ze M 97 45
25999: ze M 98 38
26000: ze M 99 39
此代码模拟时态数据维度。在这种情况下,日期按周分隔,但实际数据不必遵循这种一致性。几周可能会丢失。
timeDat <- with(demoDat, CJ(unique(Location), seq(from=as.Date("2016-01-01"),by=7,length.out = 52)))
setnames(timeDat, c("Location","Date"))
totals <- demoDat[, .(Val=sum(Val)), by=.(Location)]
timeDat[totals, Val := rmultinom(1:.N, i.Val, prob=rep(1,.N)), by=.EACHI,on=.(Location)]
Location Date Val
1: aa 2016-01-01 176
2: aa 2016-01-08 143
3: aa 2016-01-15 143
---
6758: ze 2016-12-09 165
6759: ze 2016-12-16 142
6760: ze 2016-12-23 156
每个位置的Val
列都应该在demoDat
和timeDat
数据集中相同。
timeDat[, sum(Val), by=.(Location)][order(-V1)][1:5]
# Location V1
# 1: jb 8229
# 2: xb 8223
# 3: ad 8179
# 4: nc 8176
# 5: gd 8173
demoDat[, sum(Val), by=.(Location)][order(-V1)][1:5]
# Location V1
# 1: jb 8229
# 2: xb 8223
# 3: ad 8179
# 4: nc 8176
# 5: gd 8173
接下来,我想创建一个包含Age
,Gender
和Date
变量的数据集。但是我需要在Val
和demoDat
数据集中维持timeDat
的边际总和。
我有一个完成此任务的策略,但它占用了相当多的RAM。我可以采用另一种策略,一次在每个组内进行扩展吗?也许用
.EACHI
?
这是操作中昂贵的部分。扩展数据集,使行数等于sum(Val)
。在sum(Val)
为> 500,000,000
的情况下,这可能会很昂贵。特别是因为对第二数据集重复该操作。我希望使用.EACHI
,这样只会扩展组内的数据,这会大大降低内存占用量。
library(pryr)
memUsed <- mem_used()
demoDatBig <- demoDat[rep(1:.N, Val), .(Location, Gender, Age, ID=rowid(Location))]
timeDatBig <- timeDat[rep(1:.N, Val), .(Location, Date, ID=rowid(Location))]
demoDatBig[timeDatBig, Date := i.Date, on=.(Location, ID)]
finalBigDat <- demoDatBig[, .(Val=.N), by=.(Location, Gender, Age, Date)]
mem_used() - memUsed
# 47 MB
所以这个操作需要47 MB的RAM,但如果我增加meanVal
,它会显着增加。我希望这可以使用尽可能多的RAM,因为此操作将对最大的Location
和ID
组执行相同的操作。我认为这可以使用.EACHI
,但我不确定如何。
Location Gender Age Date Val
1: aa F 0 2016-01-01 36
2: aa F 1 2016-01-01 47
3: aa F 2 2016-01-01 29
4: aa F 3 2016-01-01 40
5: aa F 4 2016-01-01 24
---
32430: ze M 96 2016-12-16 7
32431: ze M 96 2016-12-23 34
32432: ze M 97 2016-12-23 45
32433: ze M 98 2016-12-23 38
32434: ze M 99 2016-12-23 39
#### Test 1
test1 <- finalBigDat[, .(Val = sum(Val)), by=.(Location, Gender, Age)]
test1[demoDat, ValCheck := i.Val, on=.(Location, Gender, Age)]
test1[Val != ValCheck]
#Empty data.table (0 rows) of 5 cols: Location,Gender,Age,Val,ValCheck
#### Test 2
test2 <- finalBigDat[, .(Val = sum(Val)), by=.(Location, Date)]
test2[timeDat, ValCheck := i.Val, on=.(Location, Date)]
test2[Val != ValCheck]
#Empty data.table (0 rows) of 4 cols: Location,Date,Val,ValCheck
我查看了两个解决方案并跟踪了两者的内存和系统时序。这两种解决方案都令人惊叹,并且是我目前拥有的巨大升级。 @ swihart的解决方案无法令人难以置信地扩展到大meanVal
,所以我选择这个作为公认的答案。希瑟的回答将有助于meanVal
不那么大的情况。大小meanVal
经常出现,所以我需要两者。
meanVal Ans Time Mem Rows
1: 40 Mike.Gahan 0.6245470 secs 44.54293 32434
2: 40 Heather Turner 0.6391492 secs 38.65355 1352000
3: 40 swihart 11.1602619 secs 66.97550 1352000
4: 400 Mike.Gahan 2.593275 secs 437.23832 32611
5: 400 Heather Turner 1.303993 secs 38.79871 1352000
6: 400 swihart 11.736836 secs 66.97550 1352000
7: 4000 Mike.Gahan 30.390986 secs 4364.51501 32629
8: 4000 Heather Turner 6.279249 secs 38.79871 1352000
9: 4000 swihart 11.427965 secs 66.97550 1352000
10: 20000 Mike.Gahan -------did not finish----------
11: 20000 Heather Turner 23.78948 secs 36.30617 1352000
12: 20000 swihart 11.53811 secs 66.97550 1352000
13: 30000 Mike.Gahan -------did not finish----------
14: 30000 Heather Turner 537.6459 secs 57.15375 1352000
15: 30000 swihart 11.970013 secs 66.97474 1352000
答案 0 :(得分:1)
我针对不同大小的meanVal
运行了您的方法,并看到了生成demoDatBig
和timeDatBig
方法的扩展问题。我有一个方法(包含在这篇文章的底部),生成cartDat
- 日期和性别年龄组的笛卡尔十字架,对meanVal
和sum(Val)
的增长很有帮助,如此表中所示,列出了正在讨论的data.tables的object.size()
的结果:
| meanVal | sum(Val) | demoDatBig (MB) | timeDatBig (MB) | cartDat (MB) |
|----------|----------|------------------|------------------|---------------|
| 40 | 1e6 | 27.8 | 15.9 | 67.1 |
| 400 | 1e7 | 277.6 | 158.7 | 67.1 |
| 4,000 | 1e8 | 2,776.8 | 1,586.8 | 67.1 |
| 40,000 | 1e9 | 27,770.3 | 15,868.7 | 67.1 |
我的方法的关键是在未扩展的源data.tables demoDat
和timeDat
之间生成笛卡尔交叉,然后使用“迭代多变量超几何采样”(IMHS
)保存两个源数据边缘的方案。为了获得IMHS
的R功能,我从CRAN获取R包BiasedUrn并重新编译它以便它可以处理52种颜色(在我们的应用程序中,日期)。如果需要调整给定位置的最大日期数,请告诉我,我将重新编译。因此,R package BiasedUrn52 is on github。
我的解决方案通过了test1
和test2
并保留了边距。然而,它似乎将性别年龄边际分布在比OP程序更多的日期。请允许我详细说明:
如果我们采用timeDat
的前5行:
> head(demoDat,5)
Location Gender Age Val
1: aa F 0 36
2: aa F 1 47
3: aa F 2 29
4: aa F 3 40
5: aa F 4 50
finalBigDat
的前6个:
> head(finalBigDat,6)
Location Gender Age Date Val
1: aa F 0 2016-01-01 36
2: aa F 1 2016-01-01 47
3: aa F 2 2016-01-01 29
4: aa F 3 2016-01-01 40
5: aa F 4 2016-01-01 24
6: aa F 4 2016-01-08 26
我们看到F-0性别年龄组的全部36个归因于2016-01-01,而F-4组的50个分布于2016-01-01(24)和2016-01 -08(26),但没有其他日期(50 = 24 + 26)。
IMHS
方法将边距分配到更多日期(我不确定是否需要) - 请告诉我。例如,IMHS
占用了F-0组中的36个,而不是将所有36个放在2016-01-01中,而不是finalBigDat
,它将它们分散到更多日期(查看seq.Draws
):
> cartDat[Location=='aa' & Gender=="F" & Age==0,
+ c("Location", "Gender", "Age", "Date", "seq.Draws"),
+ with=FALSE]
Location Gender Age Date seq.Draws
1: aa F 0 2016-01-01 1
2: aa F 0 2016-01-08 0
3: aa F 0 2016-01-15 1
4: aa F 0 2016-01-22 1
5: aa F 0 2016-01-29 0
6: aa F 0 2016-02-05 0
7: aa F 0 2016-02-12 0
8: aa F 0 2016-02-19 0
9: aa F 0 2016-02-26 0
10: aa F 0 2016-03-04 0
11: aa F 0 2016-03-11 0
12: aa F 0 2016-03-18 0
13: aa F 0 2016-03-25 3
14: aa F 0 2016-04-01 1
15: aa F 0 2016-04-08 0
16: aa F 0 2016-04-15 0
17: aa F 0 2016-04-22 1
18: aa F 0 2016-04-29 1
19: aa F 0 2016-05-06 0
20: aa F 0 2016-05-13 2
21: aa F 0 2016-05-20 0
22: aa F 0 2016-05-27 0
23: aa F 0 2016-06-03 0
24: aa F 0 2016-06-10 0
25: aa F 0 2016-06-17 1
26: aa F 0 2016-06-24 2
27: aa F 0 2016-07-01 0
28: aa F 0 2016-07-08 0
29: aa F 0 2016-07-15 0
30: aa F 0 2016-07-22 1
31: aa F 0 2016-07-29 0
32: aa F 0 2016-08-05 1
33: aa F 0 2016-08-12 1
34: aa F 0 2016-08-19 1
35: aa F 0 2016-08-26 1
36: aa F 0 2016-09-02 1
37: aa F 0 2016-09-09 2
38: aa F 0 2016-09-16 0
39: aa F 0 2016-09-23 1
40: aa F 0 2016-09-30 0
41: aa F 0 2016-10-07 2
42: aa F 0 2016-10-14 3
43: aa F 0 2016-10-21 0
44: aa F 0 2016-10-28 1
45: aa F 0 2016-11-04 1
46: aa F 0 2016-11-11 1
47: aa F 0 2016-11-18 0
48: aa F 0 2016-11-25 0
49: aa F 0 2016-12-02 2
50: aa F 0 2016-12-09 1
51: aa F 0 2016-12-16 1
52: aa F 0 2016-12-23 1
OP方法与IMHS
cartDat
方法之间的分布差异的注意事项只是一个旁边。边缘保留,如下所示。
保留timeDat
的边距:
> cartDat[, sum(seq.Draws), by=.(Location, Date)]
Location Date V1
1: aa 2016-01-01 176
2: aa 2016-01-08 143
3: aa 2016-01-15 143
4: aa 2016-01-22 154
5: aa 2016-01-29 174
---
6756: ze 2016-11-25 169
6757: ze 2016-12-02 148
6758: ze 2016-12-09 165
6759: ze 2016-12-16 142
6760: ze 2016-12-23 156
> timeDat
Location Date Val
1: aa 2016-01-01 176
2: aa 2016-01-08 143
3: aa 2016-01-15 143
4: aa 2016-01-22 154
5: aa 2016-01-29 174
---
6756: ze 2016-11-25 169
6757: ze 2016-12-02 148
6758: ze 2016-12-09 165
6759: ze 2016-12-16 142
6760: ze 2016-12-23 156
demoDat
:
> cartDat[, sum(seq.Draws), by=.(Location, Gender, Age)]
Location Gender Age V1
1: aa F 0 36
2: aa F 1 47
3: aa F 2 29
4: aa F 3 40
5: aa F 4 50
---
25996: ze M 95 48
25997: ze M 96 41
25998: ze M 97 45
25999: ze M 98 38
26000: ze M 99 39
> demoDat
Location Gender Age Val
1: aa F 0 36
2: aa F 1 47
3: aa F 2 29
4: aa F 3 40
5: aa F 4 50
---
25996: ze M 95 48
25997: ze M 96 41
25998: ze M 97 45
25999: ze M 98 38
26000: ze M 99 39
以下是IMHS
cartDat
方法和一些测试:
#Cartesian cross of demoDat and timeDat
devtools::install_github("swihart/BiasedUrn52")
library(BiasedUrn52)
setkey(timeDat, Location)
setkey(demoDat, Location, Gender, Age)
cartDat <- demoDat[timeDat, allow.cartesian=TRUE]
setkeyv(cartDat, key(demoDat))
cartDat
cartDat[,group:=.GRP,by=c("Gender", "Age") ]
cartDat[,demoDat.Val:=Val]
cartDat[,timeDat.Val:=i.Val]
setcolorder(cartDat, c("Location",
"group",
"Gender",
"Age",
"Val",
"demoDat.Val",
"Date",
"timeDat.Val",
"i.Val"))
#Define Iterative Multivariate Hypergeometric Sampling function
imhs <- function(.N, Val, i.Val, group){
grp.ind <- unique(group)
num.grp <- max(group)
grp.size <- as.numeric(table(group))
draws <- rep(NA, length(group))
for(grp in grp.ind){
if(grp==1){
draws[group==1] = rMFNCHypergeo(1,
i.Val[group==1],
Val[group==1][1],
rep(1/grp.size[grp.ind==1],grp.size[grp.ind==1])
)
i.Val[group==2]= i.Val[group==1]-draws[group==1]
}else{
draws[group==grp] = rMFNCHypergeo(1,
i.Val[group==grp],
Val[group==grp][1],
rep(1/grp.size[grp.ind==grp],grp.size[grp.ind==grp])
)
if(grp<=num.grp){
i.Val[group==(grp+1)]= i.Val[group==grp]-draws[group==grp]
}
}
}
list(i.Val, draws)
}
# run it the data.table way:
cartDat[,
c("seq.Val", "seq.Draws") := imhs(.N, demoDat.Val, timeDat.Val, group),
by=c("Location") ]
# take a look:
cartDat
# reconciliation
demoDat[, sum(Val), by=.(Location)][order(-V1)]
cartDat[, sum(seq.Draws), by=.(Location)][order(-V1)]
# do the checks for the margins:
cartDat[, sum(seq.Draws), by=.(Location, Date)]
timeDat
cartDat[, sum(seq.Draws), by=.(Location, Gender, Age)]
demoDat
# such different sizes due to distributing across more dates:
nrow(demoDat)
nrow(cartDat)
nrow(cartDat[seq.Draws != 0])
nrow(finalBigDat)
nrow(cartDat[seq.Draws != 0])/nrow(finalBigDat)
# attain and print object sizes for cartDat
print(object.size(cartDat), units = "Mb")
print(object.size(cartDat[seq.Draws!=0]), units="Mb")
# attain and print object sizes for demoDatBig, timeDatBig, finalBigData
print(object.size(demoDatBig), units = "Mb")
print(object.size(timeDatBig), units = "Mb")
print(object.size(finalBigDat), units = "Mb")
## (OP) The solution would pass these tests:
finalBigDat2 <- cartDat
#### Test 1 (change to sum(seq.Draws))
test1 <- finalBigDat2[, .(Val = sum(seq.Draws)), by=.(Location, Gender, Age)]
test1[demoDat, ValCheck := i.Val, on=.(Location, Gender, Age)]
test1[Val != ValCheck]
#Empty data.table (0 rows) of 5 cols: Location,Gender,Age,Val,ValCheck
#### Test 2 (change to sum(seq.Draws))
test2 <- finalBigDat2[, .(Val = sum(seq.Draws)), by=.(Location, Date)]
test2[timeDat, ValCheck := i.Val, on=.(Location, Date)]
test2[Val != ValCheck]
#Empty data.table (0 rows) of 4 cols: Location,Date,Val,ValCheck
答案 1 :(得分:1)
通过加入位置和ID,您当前的方法可以系统地将日期与性别:年龄类别中的观察结果进行匹配,例如:将第一个日期分配给前176个观察结果(涵盖前四个性别:年龄类别和第五个部分),依此类推。
相反,您可以使用统计信息包中的Box::Box(int _width, int _height, bool filled)
{
if (filled)
{
for (int i = 0; i < _height; i++)
{
cout << "x";
for (int j = 1; j < _width; j++)
{
cout << "x";
}
cout << std::endl;
}
}
else if (!filled);
{
for (int i = 0; i < _height; i++)
{
if (i != 0 && i != _height - 1)
{
cout << "x";
for (int j = 1; j < _width-1; j++)
{
cout << " ";
}
cout << "x";
}
else
{
for (int _longrow = 0; _longrow < _width; _longrow++)
{
cout << "x";
}
}
cout << std::endl;
}
}
}
string Box::type() const
{
if (filled)
{
return "Filled";
}
else if (!filled)
{
return "Hollow";
}
}
函数来模拟每个性别中每个日期的计数:年龄类别,以模拟双向(性别:年龄):日期列联表总计固定的列联表。
首先在每个位置创建性别,年龄和日期的组合
r2dtable
然后在每个位置应用setkey(timeDat, Location)
setkey(demoDat, Location)
finalBigDat <- demoDat[timeDat, .(Location, Gender, Age, Date),
allow.cartesian=TRUE]
,将行总计设置为每个性别中的计数:年龄类别,列总数作为每个日期的计数:
r2dtable
根据需要通过测试1和2。无论setkey(finalBigDat, Location)
finalBigDat[, Val := c(r2dtable(1, demoDat[.BY, Val], timeDat[.BY, Val])[[1]]),
by = Location]
head(finalBigDat)
# Location Gender Age Date Val
# 1: aa F 0 2016-01-01 0
# 2: aa F 1 2016-01-01 3
# 3: aa F 2 2016-01-01 2
# 4: aa F 3 2016-01-01 2
# 5: aa F 4 2016-01-01 1
# 6: aa F 5 2016-01-01 0
的值是什么,finalBigDat
的维度始终相同,但meanVal
需要更长的时间来运行更高的rd2table
。以下是一些说明性的时间:
meanVal
我尝试使用meanVal = 40000,但是停止了大约300秒运行的代码,因此如果你想设置一个高的meanVal,你可能会遇到困难。