查找列中值之间给定差异的行

时间:2012-05-23 18:12:53

标签: r dataframe data.table

对于R中的data.table(或data.frame),我希望找到包含值' value'中的值的所有行。这是一个给定的距离'距离'来自另一个具有相同键的行中的值。所以,鉴于以下内容:

distance <- 22
   key value
   A     1
   B     1
   C     1
   D     1
   A     4
   B     4
   A    23
   B    23
   B    26
   B    26
   C    30

我想在原始表中注释一个具有相同键的行数,以及一个+22的值:

  key value count
  A     1     1
  B     1     1
  C     1     0
  D     1     0
  A     4     0
  B     4     2
  A    23     0
  B    23     0
  B    26     0
  B    26     0
  C    30     0

我真的不知道从哪里开始使用这种自我引用的方法来操纵R中的数据。我最初的尝试涉及创建第二个表并试图与之匹配,但这似乎是一种奇怪而糟糕的方法。

注意:我使用的是data.table软件包,但在这种情况下,我很乐意使用data.frame工作,如果这样可以让事情变得更轻松。

可重复性:

require(data.table)
source <- data.table(data.frame(key=c("A","B","C","D","A","B","A","B","B","B", "C"),value=c(1,1,1,1,4,4,23,23,26,26,30)))
result <- data.table(data.frame(key=c("A","B","C","D","A","B","A","B","B","B","C"),value=c(1,1,1,1,4,4,23,23,26,26,30),count=c(1,1,0,0,0,2,0,0,0,0,0)))

2 个答案:

答案 0 :(得分:5)

这是基于data.table的解决方案。我有兴趣了解可以对它进行哪些改进(如果有的话)。

# Your code
library(data.table)
source <- 
data.table(data.frame(key = c("A","B","C","D","A","B","A","B","B","B", "C"),
                      value = c(1,1,1,1,4,4,23,23,26,26,30)))

奇怪的data.table(data.frame(...是因为data.table()也有一个名为key的参数。这是使用名为data.table的列创建"key"的一种方法。大写以避免参数名称冲突说明了更标准的语法:

source <- data.table(Key = c("A","B","C","D","A","B","A","B","B","B","C"),
                     Value = c(1,1,1,1,4,4,23,23,26,26,30))

接下来为了避免以后需要as.integer(),我们现在会将Value列的类型从numeric更改为integer。请记住,R中的1numeric1Linteger。将integer数据存储为integer,而将integer存储为numeric,效率通常更高。下一行比在上面输入大量L更容易。

source[,Value:=as.integer(Value)]   # change type from `numeric` to `integer`

现在继续

distance <- 22L
setkey(source, Key, Value)

# Heart of the solution (following a few explanatory comments):
#  "J()"   : shorthand for 'data.table()'
#  ".N"    : returns the number of rows that matched a line (see ?data.table)
#  "[[3]]" : as with simple data.frames, extracts the vector in column 3

source[,count:=source[J(Key,Value+distance),.N][[3]]]
source
      key value count
 [1,]   A     1     1
 [2,]   A     4     0
 [3,]   A    23     0
 [4,]   B     1     1
 [5,]   B     4     2
 [6,]   B    23     0
 [7,]   B    26     0
 [8,]   B    26     0
 [9,]   C     1     0
[10,]   C    30     0
[11,]   D     1     0

请注意:=直接通过引用更改了source,这就是它。但setkey()也改变了原始数据的顺序。如果需要保留原始订单,则:

source <- data.table(Key = c("A","B","C","D","A","B","A","B","B","B","C"),
                     Value = c(1,1,1,1,4,4,23,23,26,26,30))
source[,Value:=as.integer(Value)]   
source[,count:=setkey(copy(source))[source[,list(Key,Value+distance)],.N][[3]]]

      Key Value count
 [1,]   A     1     1
 [2,]   B     1     1
 [3,]   C     1     0
 [4,]   D     1     0
 [5,]   A     4     0
 [6,]   B     4     2
 [7,]   A    23     0
 [8,]   B    23     0
 [9,]   B    26     0
[10,]   B    26     0
[11,]   C    30     0

答案 1 :(得分:1)

您可以使用mapply循环键入和值的所有组合:

data.table(t(mapply(function(key,val) 
      c(key=key,value=val,count=length(source$value[source$key==key & source$value>(val+distance)]) )
   , as.character(source$key),source$value)))