我有以下数据框(RES1):
"sequence" "support"
"1" "<{OV50}>" 0.286
"2" "<{OV148}>" 0.121
其他数据框(SRC2):
"sequenceID" "transactionID" "eventID" "items"
"1" 42207993 1577 1 "OV50"
"2" 42207993 6048 2 "OV11"
"3" 42207993 1597 3 "OV148"
"4" 57237976 12423 1 "OV56"
"5" 57237976 12589 2 "OV148"
我想得到以下输出数据帧(OUT3):
"sequenceID" "transactionID" "eventID" "items" "Exist" "Co"
"1" 42207993 1577 1 "OV50" 1
"2" 42207993 6048 2 "OV11" 0
"3" 42207993 1597 3 "OV148" 1 0.67
"4" 57237976 12423 1 "OV56" 0
"5" 57237976 12589 2 "OV148" 1 0.5
对于SRC2中的每一行&#34; Exist&#34; OUT3中的列将是&#39; 0&#39;如果在RES1中根本没有值。例如,OV11根本不会出现在RES1中,因此其值为0。 在sequenceID的最后一个值 - &#39; 1&#39;将值除以相同sequenceID的数量,并将其添加到&#34; Co&#34;柱。在第3行中,有3行序列ID = 42207993,数字为&#39; 1&#39;是2 2/3 = 0.67。 我想找到最有效的方法,因为每个数据帧都是非常大的数据帧。
答案 0 :(得分:1)
一种选择是使用data.table
。我们将'data.frame'转换为data.table
(setDT(SRC2)
),使用gsub
删除'RES1'的'sequence'列中的标点字符,检查它是否存在于'{1}}中items',通过用+
换行将逻辑向量强制转换为二进制,并将输出分配(:=
)到新列'Exist'。按'sequenceID'分组,我们将'存在'的sum
除以nrow(.N
),round
,转换为'character'并将其指定为'Co'。然后,我们得到那些不是每个'sequenceID'的最后一行的元素的行索引(.I
),并将它们分配给''
。
library(data.table)#v1.9.6+
setDT(SRC2)[, Exist := +(items %chin% gsub('[^[:alnum:]]+',
'', RES1$sequence))]
i1 <- SRC2[, Co:= as.character(round(sum(Exist)/.N, 2)) ,
sequenceID][, .I[1:(.N-1)], sequenceID]$V1
SRC2[i1, Co:= '']
SRC2
# sequenceID transactionID eventID items Exist Co
#1: 42207993 1577 1 OV50 1
#2: 42207993 6048 2 OV11 0
#3: 42207993 1597 3 OV148 1 0.67
#4: 57237976 12423 1 OV56 0
#5: 57237976 12589 2 OV148 1 0.5
SRC2 <- structure(list(sequenceID = c(42207993L, 42207993L,
42207993L,
57237976L, 57237976L), transactionID = c(1577L, 6048L, 1597L,
12423L, 12589L), eventID = c(1L, 2L, 3L, 1L, 2L),
items = c("OV50",
"OV11", "OV148", "OV56", "OV148")), .Names = c("sequenceID",
"transactionID", "eventID", "items"), class = "data.frame",
row.names = c("1", "2", "3", "4", "5"))
RES1 <- structure(list(sequence = c("<{OV50}>", "<{OV148}>"),
support = c(0.286,
0.121)), .Names = c("sequence", "support"),
class = "data.frame", row.names = c("1", "2"))