对于R中的data.table(或data.frame),我希望找到包含值' value'中的值的所有行。这是一个给定的距离'距离'来自另一个具有相同键的行中的值。所以,鉴于以下内容:
distance <- 22
key value
A 1
B 1
C 1
D 1
A 4
B 4
A 23
B 23
B 26
B 26
C 30
我想在原始表中注释一个具有相同键的行数,以及一个+22的值:
key value count
A 1 1
B 1 1
C 1 0
D 1 0
A 4 0
B 4 2
A 23 0
B 23 0
B 26 0
B 26 0
C 30 0
我真的不知道从哪里开始使用这种自我引用的方法来操纵R中的数据。我最初的尝试涉及创建第二个表并试图与之匹配,但这似乎是一种奇怪而糟糕的方法。
注意:我使用的是data.table
软件包,但在这种情况下,我很乐意使用data.frame工作,如果这样可以让事情变得更轻松。
可重复性:
require(data.table)
source <- data.table(data.frame(key=c("A","B","C","D","A","B","A","B","B","B", "C"),value=c(1,1,1,1,4,4,23,23,26,26,30)))
result <- data.table(data.frame(key=c("A","B","C","D","A","B","A","B","B","B","C"),value=c(1,1,1,1,4,4,23,23,26,26,30),count=c(1,1,0,0,0,2,0,0,0,0,0)))
答案 0 :(得分:5)
这是基于data.table
的解决方案。我有兴趣了解可以对它进行哪些改进(如果有的话)。
# Your code
library(data.table)
source <-
data.table(data.frame(key = c("A","B","C","D","A","B","A","B","B","B", "C"),
value = c(1,1,1,1,4,4,23,23,26,26,30)))
奇怪的data.table(data.frame(...
是因为data.table()
也有一个名为key
的参数。这是使用名为data.table
的列创建"key"
的一种方法。大写以避免参数名称冲突说明了更标准的语法:
source <- data.table(Key = c("A","B","C","D","A","B","A","B","B","B","C"),
Value = c(1,1,1,1,4,4,23,23,26,26,30))
接下来为了避免以后需要as.integer()
,我们现在会将Value
列的类型从numeric
更改为integer
。请记住,R中的1
为numeric
,1L
为integer
。将integer
数据存储为integer
,而将integer
存储为numeric
,效率通常更高。下一行比在上面输入大量L
更容易。
source[,Value:=as.integer(Value)] # change type from `numeric` to `integer`
现在继续
distance <- 22L
setkey(source, Key, Value)
# Heart of the solution (following a few explanatory comments):
# "J()" : shorthand for 'data.table()'
# ".N" : returns the number of rows that matched a line (see ?data.table)
# "[[3]]" : as with simple data.frames, extracts the vector in column 3
source[,count:=source[J(Key,Value+distance),.N][[3]]]
source
key value count
[1,] A 1 1
[2,] A 4 0
[3,] A 23 0
[4,] B 1 1
[5,] B 4 2
[6,] B 23 0
[7,] B 26 0
[8,] B 26 0
[9,] C 1 0
[10,] C 30 0
[11,] D 1 0
请注意:=
直接通过引用更改了source
,这就是它。但setkey()
也改变了原始数据的顺序。如果需要保留原始订单,则:
source <- data.table(Key = c("A","B","C","D","A","B","A","B","B","B","C"),
Value = c(1,1,1,1,4,4,23,23,26,26,30))
source[,Value:=as.integer(Value)]
source[,count:=setkey(copy(source))[source[,list(Key,Value+distance)],.N][[3]]]
Key Value count
[1,] A 1 1
[2,] B 1 1
[3,] C 1 0
[4,] D 1 0
[5,] A 4 0
[6,] B 4 2
[7,] A 23 0
[8,] B 23 0
[9,] B 26 0
[10,] B 26 0
[11,] C 30 0
答案 1 :(得分:1)
您可以使用mapply
循环键入和值的所有组合:
data.table(t(mapply(function(key,val)
c(key=key,value=val,count=length(source$value[source$key==key & source$value>(val+distance)]) )
, as.character(source$key),source$value)))