在下面的data.table
中,我了解参与项目的团队的组成。变量id
告诉团队ID,变量event
给出项目编号。变量freqrel
描述了团队的组成(你可以看到freqrel在每个团队中加起来为1)。
structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L), event = c("127b", "127b", "127b", "127b",
"127b", "127b", "127b", "127b", "127b", "125t", "125t", "125t",
"125t", "125t", "125t"), membr = c("engineer", "mathematician",
"physicist", "mathematician", "physicist", "surgeon", "dentist",
"mathematician", "programmer", "physicist", "sociologist", "surgeon",
"musician", "sociologist", "surgeon"), freqrel = c(0.4, 0.4,
0.2, 0.166666666666667, 0.5, 0.333333333333333, 0.333333333333333,
0.5, 0.166666666666667, 0.75, 0.125, 0.125, 0.444444444444444,
0.444444444444444, 0.111111111111111)), .Names = c("id", "event",
"membr", "freqrel"), row.names = c(NA, -15L), class = c("data.table",
"data.frame"), sorted = c("id", "event"), .internal.selfref = <pointer: 0x039a24a0>)
我看到数据的方式被拆分为嵌套组。第一个分区发生在项目级别(直线),第二个分支发生在团队级别(虚线)。
id event membr freqrel
1: 1 127b engineer 0.4000000
2: 1 127b mathematician 0.4000000
3: 1 127b physicist 0.2000000
--------------------------------------
4: 2 127b mathematician 0.1666667
5: 2 127b physicist 0.5000000
6: 2 127b surgeon 0.3333333
--------------------------------------
7: 3 127b dentist 0.3333333
8: 3 127b mathematician 0.5000000
9: 3 127b programmer 0.1666667
_____________________________________
10: 4 125t physicist 0.7500000
11: 4 125t sociologist 0.1250000
12: 4 125t surgeon 0.1250000
--------------------------------------
13: 5 125t musician 0.4444444
14: 5 125t sociologist 0.4444444
15: 5 125t surgeon 0.1111111
从这个起始条件开始,我想让同一个项目中的团队完全具有可比性,通过向他们每个人添加团队没有特征的membr
类型,为他们分配freqrel = 0。结果应该是这样的:
id event membr freqrel
1: 1 127b dentist 0.0000000
2: 1 127b engineer 0.4000000
3: 1 127b mathematician 0.4000000
4: 1 127b physicist 0.2000000
5: 1 127b programmer 0.0000000
6: 1 127b surgeon 0.0000000
--------------------------------------
7: 2 127b dentist 0.0000000
8: 2 127b engineer 0.0000000
9: 2 127b mathematician 0.1666667
10: 2 127b physicist 0.5000000
11: 2 127b programmer 0.0000000
12: 2 127b surgeon 0.3333333
--------------------------------------
13: 3 127b dentist 0.3333333
14: 3 127b engineer 0.0000000
15: 3 127b mathematician 0.5000000
16: 3 127b physicist 0.0000000
17: 3 127b programmer 0.1666667
18: 3 127b surgeon 0.0000000
_____________________________________
19: 4 125t musician 0.0000000
20: 4 125t physicist 0.7500000
21: 4 125t sociologist 0.1250000
22: 4 125t surgeon 0.1250000
--------------------------------------
23: 5 125t musician 0.4444444
24: 5 125t physicist 0.0000000
25: 5 125t sociologist 0.4444444
26: 5 125t surgeon 0.1111111
换句话说,在使用by
作为关键字将数据除以event
之后,我需要再次划分并比较通过第二次拆分获得的数据块。
但问题在于我不知道如何引用用by
获得的第一个块,然后再如何再次拆分并在数据库的各个部分之间进行比较。你知道我怎么能解决这个问题吗?
如果你能帮助我,我会非常感激。真。
答案 0 :(得分:6)
这是一个简单的方法:
setkey(dt, id, membr)
ans <- dt[, .SD[CJ(unique(id), unique(membr))], by=list(event)]
然后,您可以将NA
替换为0&#39; s,如下所示:
ans[is.na(freqrel), freqrel := 0.0]
一些解释:您的问题可归结为此问题 - 对于每个event
,您需要id, membr
的所有可能组合,以便您可以在其中对此全部组合执行联接使用.SD
分组。
首先,我们按event
进行分组,然后我们首先在id, membr
的帮助下获得CJ
的所有组合(其中的密钥将设置为所有列默认)。但是,要执行连接,我们需要为.SD
设置密钥。因此,我们会事先将key
dt
设置为id, membr
。因此,我们在每个组中执行连接,并为您提供预期的结果。希望这有点帮助。