Index unique values in data.table

时间:2016-02-12 20:08:22

标签: r data.table

Not sure how to formulate the question in words, but how can I create an index-column for a data.table that per group increments when a different value appear?

Here is the MWE

library(data.table)
in.data <- data.table(fruits=c(rep("banana", 4), rep("pear", 5)),vendor=c("a", "b", "b", "c", "d", "d", "e", "f", "f"))

Here is the result the R-code should generate

in.data[, wanted.column:=c(1,2,2,3,1,1,2,3,3)]

#    fruits vendor wanted.column
# 1: banana      a             1
# 2: banana      b             2
# 3: banana      b             2
# 4: banana      c             3
# 5:   pear      d             1
# 6:   pear      d             1
# 7:   pear      e             2
# 8:   pear      f             3
# 9:   pear      f             3

So it labels each vendor 1, 2, 3, ... within each fruit. There is probably a very simple solution, but I'm stuck.

3 个答案:

答案 0 :(得分:9)

I have a few ideas. You can use a nested group counter:

GlobalConfiguration.Configuration.Formatters.Remove(GlobalConfiguration.Configuration.Formatters.XmlFormatter);

Alternately, make a run ID, which depends on sorted data (thanks @eddi) and seems wasteful:

in.data[, w := setDT(list(v = vendor))[, g := .GRP, by=v]$g, by=fruits]

The base-R approach would probably be:

in.data[, w := rleid(vendor), by=fruits]

答案 1 :(得分:9)

Another approach might be two steps :

namespace Test
public class enregistre
{
 public DateTime date { get; set; }
 }

The way I would comment this in production code might be :

DT = data.table(fruits=c(rep("banana", 4), rep("pear", 5)),vendor=c("a", "b", "b", "c", "d", "d", "e", "f", "f"))
DT
   fruits vendor
1: banana      a
2: banana      b
3: banana      b
4: banana      c
5:   pear      d
6:   pear      d
7:   pear      e
8:   pear      f
9:   pear      f
DT[, wanted:=.GRP, by="fruits,vendor"]  # step 1
DT
   fruits vendor wanted
1: banana      a      1
2: banana      b      2
3: banana      b      2
4: banana      c      3
5:   pear      d      4
6:   pear      d      4
7:   pear      e      5
8:   pear      f      6
9:   pear      f      6
DT[, wanted:=wanted-wanted[1]+1L, by="fruits"]  # step 2 (adjust)
DT
   fruits vendor wanted
1: banana      a      1
2: banana      b      2
3: banana      b      2
4: banana      c      3
5:   pear      d      1
6:   pear      d      1
7:   pear      e      2
8:   pear      f      3
9:   pear      f      3
> 

答案 2 :(得分:4)

如果您希望索引与给定水果中的所有供应商的相同,那么这是另一种选择:

in.data[, wanted := as.integer(factor(vendor, levels = unique(vendor))), by = fruits]

否则,如果您希望每次供应商更改时都勾选,那么,从目前为止的给定答案中,rleid是唯一有效的。