获取所有独特元素的索引

时间:2015-10-07 21:01:18

标签: r text-mining data-processing

我有一个包含500 000个条目的数据集。其中的每个条目都有userId和productId。我想获得与每个不同productIds相对应的所有userIds。但是列表很大,以下方法都不适用于我,它的速度非常慢。有没有更快的解决方案。

使用lapply :(问题:遍历每个uniqPids元素的整个rpid列表)

orderedIndx <- lapply(uniqPids, function(x){
    which(rpid %in% x)
})
names(orderedIndx) <- uniqPids
#Looking for indices with each unique productIds

使用For循环:

  orderedIndx <- list()
  for(j in 1:length(rpid)){
    existing <- length(orderedIndx[rpid[j]])
    orderedIndx[rpid[j]][existing + 1] <- j
  }

示例数据:

ruid[1:10]
# [1] "a3sgxh7auhu8gw" "a1d87f6zcve5nk" "abxlmwjixxain"  "a395borc6fgvxv" "a1uqrsclf8gw1t" "adt0srk1mgoeu" 
 [7] "a1sp2kvkfxxru1" "a3jrgqveqn31iq" "a1mzyo9tzk0bbi" "a21bt40vzccyt4"

rpid[1:10]
# [1] "b001e4kfg0" "b001e4kfg0" "b000lqoch0" "b000ua0qiq" "b006k2zz7k" "b006k2zz7k" "b006k2zz7k" "b006k2zz7k"
 [9] "b000e7l2r4" "b00171apva"

输出应该是:

b001e4kfg0 -> a3sgxh7auhu8gw, a1d87f6zcve5nk
b000lqoch0 -> abxlmwjixxain
b000ua0qiq -> a395borc6fgvxv
b006k2zz7k -> a1uqrsclf8gw1t, adt0srk1mgoeu, a1sp2kvkfxxru1, a3jrgqveqn31iq
b000e7l2r4 -> a1mzyo9tzk0bbi
b00171apva -> a21bt40vzccyt4

3 个答案:

答案 0 :(得分:2)

似乎您只是在寻找split

split(seq_along(rpid), rpid)

答案 1 :(得分:1)

不完全确定您想要的输出类型,或数据集中有多少行,但我建议使用3个版本,您可以选择自己喜欢的版本。第一个版本使用dplyr和变量的字符值。如果你有数百万行,我希望这会很慢。第二个版本使用dplyr但是因子变量。我希望这比前一个更快。第三版使用data.table。我希望它与第二版同样快或更快。

library(dplyr)

ruid = 
c("a3sgxh7auhu8gw", "a1d87f6zcve5nk", "abxlmwjixxain",  "a395borc6fgvxv",
  "a1uqrsclf8gw1t", "adt0srk1mgoeu", "a1sp2kvkfxxru1", "a3jrgqveqn31iq",
  "a1mzyo9tzk0bbi", "a21bt40vzccyt4")

rpid =
c("b001e4kfg0", "b001e4kfg0", "b000lqoch0", "b000ua0qiq", "b006k2zz7k",
  "b006k2zz7k", "b006k2zz7k", "b006k2zz7k", "b000e7l2r4", "b00171apva")

### using dplyr and character values
dt = data.frame(rpid, ruid, stringsAsFactors = F)

dt %>%
  group_by(rpid) %>%
  do(data.frame(list_ruids = paste(c(.$ruid), collapse=", "))) %>%
  ungroup

#         rpid                                                    list_ruids
#        (chr)                                                         (chr)
# 1 b000e7l2r4                                                a1mzyo9tzk0bbi
# 2 b000lqoch0                                                 abxlmwjixxain
# 3 b000ua0qiq                                                a395borc6fgvxv
# 4 b00171apva                                                a21bt40vzccyt4
# 5 b001e4kfg0                                a3sgxh7auhu8gw, a1d87f6zcve5nk
# 6 b006k2zz7k a1uqrsclf8gw1t, adt0srk1mgoeu, a1sp2kvkfxxru1, a3jrgqveqn31iq


# ----------------------------------

### using dplyr and factor values
dt = data.frame(rpid, ruid, stringsAsFactors = T)

dt %>%
  group_by(rpid) %>%
  do(data.frame(list_ruids = paste(c(levels(dt$ruid)[.$ruid]), collapse=", "))) %>%
  ungroup

#         rpid                                                    list_ruids
#       (fctr)                                                         (chr)
# 1 b000e7l2r4                                                a1mzyo9tzk0bbi
# 2 b000lqoch0                                                 abxlmwjixxain
# 3 b000ua0qiq                                                a395borc6fgvxv
# 4 b00171apva                                                a21bt40vzccyt4
# 5 b001e4kfg0                                a3sgxh7auhu8gw, a1d87f6zcve5nk
# 6 b006k2zz7k a1uqrsclf8gw1t, adt0srk1mgoeu, a1sp2kvkfxxru1, a3jrgqveqn31iq


# -------------------------------------

library(data.table)

### using data.table
dt = data.table(rpid, ruid)

dt[, list(list_ruids = paste(c(ruid), collapse=", ")), by = rpid]

#          rpid                                                    list_ruids
# 1: b001e4kfg0                                a3sgxh7auhu8gw, a1d87f6zcve5nk
# 2: b000lqoch0                                                 abxlmwjixxain
# 3: b000ua0qiq                                                a395borc6fgvxv
# 4: b006k2zz7k a1uqrsclf8gw1t, adt0srk1mgoeu, a1sp2kvkfxxru1, a3jrgqveqn31iq
# 5: b000e7l2r4                                                a1mzyo9tzk0bbi
# 6: b00171apva                                                a21bt40vzccyt4

答案 2 :(得分:0)

您是否在数据框中拥有整洁的数据?然后你就可以做到这一点。

sizeof