Question

我有以下数据框（RES1）：

   "sequence" "support"
"1" "<{OV50}>"   0.286
"2" "<{OV148}>"  0.121
"3" "<{OV46},{OV197}>" 0.065
"4" "<{OV198},{OV199}, {OV205}>" 0.065

其他数据框（SRC2）：

  "sequenceID" "transactionID" "eventID" "items"
"1" 42207993       1577          1        "OV50"
"2" 42207993       6048          2        "OV11"
"3" 42207993       1597          3        "OV148"
"4" 57237976       12423         1        "OV46"
"5" 57237976       12589         2        "OV197"

我想得到以下输出数据帧（OUT3）：

  "sequenceID" "transactionID" "eventID" "items"  "Exist" "Co"
"1" 42207993       1577          1        "OV50"     1
"2" 42207993       6048          2        "OV11"     0
"3" 42207993       1597          3        "OV148"    1       0.67
"4" 57237976       12423         1        "OV46"     0
"5" 57237976       12589         2        "OV197"    1       0.5

对于SRC2中的每一行＆＃34; Exist＆＃34; OUT3中的列将是＆＃39; 0＆＃39;如果在RES1中根本没有值。例如，OV11根本不会出现在RES1中，因此其值为0。在sequenceID的最后一个值 - ＆＃39; 1＆＃39;将值除以相同sequenceID的数量，并将其添加到＆＃34; Co＆＃34;柱。在第3行中，有3行序列ID = 42207993，数字为＆＃39; 1＆＃39;是2 2/3 = 0.67。我想找到最有效的方法，因为每个数据帧都是非常大的数据帧。

此外，如果行包含2个或更多序列。我想以正确的顺序找到它们，这意味着OV46在OV197之前出现相同的sequenceID（57237976）我想在Exist列的OV197行显示1。 RES1行中的OV数量可以是2,3或更多的数量级。每个sequenceID的顺序是一个重要问题。 OV197之前仅OV46表示为1.

Answer 1

我们可以使用stri_extract_last中的library(stringi)从“RES1”中的“序列”列中提取最后一个字母数字字符串。使用此选项与“SRC2”中的“项目”列进行比较，并通过包装+将逻辑强制转换为二进制。在执行此操作之前，我们将'data.frame'转换为'data.table'（setDT(SRC2)）。按'sequenceID'分组，我们获得'存在'的sum，将其除以nrow（.N），round，转换为character类以创建'Co' 。使用行索引（''）将'Co'中不是每个'sequenceID'的最后一个元素更改为.I。

library(stringi)
library(data.table)
v1 <- stri_extract_last_regex(RES1$sequence, '[[:alnum:]]+')
setDT(SRC2)[, Exist:= +(items %chin% v1)]
i1 <- SRC2[, Co:= as.character(round(sum(Exist)/.N, 2)) , 
         sequenceID][, .I[1:(.N-1)], sequenceID]$V1
SRC2[i1, Co:= '']
SRC2
#   sequenceID transactionID eventID items Exist   Co
#1:   42207993          1577       1  OV50     1     
#2:   42207993          6048       2  OV11     0     
#3:   42207993          1597       3 OV148     1 0.67
#4:   57237976         12423       1  OV46     0     
#5:   57237976         12589       2 OV197     1  0.5

数据

 SRC2 <- structure(list(sequenceID = c(42207993L, 42207993L, 42207993L, 
57237976L, 57237976L), transactionID = c(1577L, 6048L, 1597L, 
12423L, 12589L), eventID = c(1L, 2L, 3L, 1L, 2L), items = c("OV50", 
"OV11", "OV148", "OV46", "OV197")), .Names = c("sequenceID", 
"transactionID", "eventID", "items"), class = "data.frame",
 row.names = c("1", "2", "3", "4", "5"))

 RES1 <- structure(list(sequence = c("<{OV50}>", "<{OV148}>", 
 "<{OV46},{OV197}>", 
"<{OV198},{OV199}, {OV205}>"), support = c(0.286, 0.121, 0.065, 
0.065)), .Names = c("sequence", "support"), class = "data.frame", 
row.names = c("1", "2", "3", "4"))

在另一个数据帧中查找多个值的有效方法

1 个答案:

数据