我如何快速将大量元素分组

时间:2019-03-08 19:36:22

标签: r for-loop grouping

我在这里创建了一些“数据”作为示例。它包含100个元素,每个元素以字母开头,后跟3个随机数字。

我想知道将它们转换为组的最佳/最快方法,就像我在for循环中开始做的那样。

假设我需要创建50个组,而不是“数据”中的100个元素,而是一百万个。

分组本身将是相当随机的。在示例中,我将A000-A599和A600-A999用作前两个分组,但是分组之间的间隔并不整齐,例如B000-B599和B600-B999不一定是下一个分组。下一个分组可以是例如B000-C299,C300-C799,C800-D499等。我需要手动输入这些分组。

我猜想for循环不是执行此操作的最佳方法,因为要花很长时间才能完成循环。

library(stringr)
library(magicfor)

data <- paste(sample(LETTERS, 100, replace = T),
    sample(str_pad(000:999, width = 3, side = "left", pad = "0"), 100, replace = T), sep = "")

magic_for()

for(x in seq_along(data)){

 if( grepl("A[0-5]", data[1])){
range <- "A000-A599"
}elseif( grepl("A[6-9]", data[1])){
range <- "A600-A999"
}

put(range)
}

4 个答案:

答案 0 :(得分:3)

您可以尝试以下操作:

redirect:

> #Round function > roundUp <- function(x,to=10) { + to*(x%/%to + as.logical(x%%to)) + } > #Create a dataframe for easy store > df <- data.frame(data = data, stringsAsFactors = F) > df %>% + mutate(C = substr(data, 1, 1), + N = as.integer(substr(data, 2, 4))) %>% + mutate(N = roundUp(N, to = 500)) %>% + mutate(data2 = paste0(C, N)) %>% + select(data, data2) data data2 1 U493 U500 2 A429 A500 3 N564 N1000 4 W656 W1000 5 J978 J1000 6 B232 B500 7 D240 D500 8 I796 I1000 9 E831 E1000 ...(truncated) 字段包含新的组

答案 1 :(得分:3)

使用SELECT DISTINCT "$path" FROM listing WHERE "$path" LIKE '%foo%' 包,我将把您的tidyverse向量转换为data(或data.frame)格式。

tibble

一旦您走到了那一步,就很容易按需要对数据进行分组。您的循环可以这样实现:

library(tidyverse)

df <- tibble(my_variable = data) %>%
  mutate(
    first_char = substr(my_variable, 1, 1),
    random_numbers = substr(my_variable, 2, 4)
  )

我建议阅读下面的(免费)电子书封面,它会为您提供一系列有用的工具,用于您日常的R任务,如您概述的任务:

https://r4ds.had.co.nz/index.html

答案 2 :(得分:2)

  

分组本身将是相当随机的。在示例中,我将A000-A599和A600-A999用作前两个分组,但是分组之间的间隔并不整齐,例如B000-B599和B600-B999不一定是下一个分组。下一个分组可以是B000-C299,C300-C799,C800-D499等。

由于您的组是按字典顺序排列的,因此可以使用滚动连接。在这种情况下,您只需要为每个组指定下限:

library(data.table)

# define decrement function
dec = function(x){
  ltr = substr(x, 1, 1)
  num = as.integer(substr(x, 2, 4))

  w0 = num == 0L
  ltr = replace(ltr, w0, LETTERS[match(ltr[w0], LETTERS) - 1L])
  num = replace(num - 1L, w0, 999L)

  sprintf("%s%03d", ltr, num)
}

# enumerate lower bounds and derive ranges
rangeDT = data.table(lb = c("A000", "A600", "B000", "C300", "C800"))
rangeDT[, ub := dec(shift(lb, type="lead", fill="Z999"))]
rangeDT[, range := sprintf("%s-%s", lb, ub)] 

#      lb   ub     range
# 1: A000 A599 A000-A599
# 2: A600 A999 A600-A999
# 3: B000 C299 B000-C299
# 4: C300 C799 C300-C799
# 5: C800 Z998 C800-Z998

然后滚动更新联接为...

DT = data.table(x = data)    
DT[, range := rangeDT[.SD, on=.(lb = x), roll=TRUE, x.range]]

结果看起来像

> head(DT)
      x     range
1: C965 C800-Z999
2: Q973 C800-Z999
3: V916 C800-Z999
4: C701 C300-C799
5: A363 A000-A599
6: F144 C800-Z999

如果您的数据是数字数据,则以R为底的cutfindInterval可以工作,但是由于任何原因,它们都不支持字符串。

答案 3 :(得分:1)

怎么样?

library(data.table)    
ranges <- c(paste0(LETTERS, "[0-5]"),paste0(LETTERS, "[6-9]"))



final <-lapply(ranges, function(y)  {
                                    matches <- grepl(y, data)
                                    if(sum(matches)>0){
                                    tmp <-data.table(element=data[matches], range=
                                                       paste0(str_sub(y,1,1), str_sub(y,3,3),0,0,"-", str_sub(y,1,1), str_sub(y,5,5),9,9))}
                                    else return(NULL)
                                      })
final_2 <- rbindlist(final)


#    element   range
#      A374 A000-A599
#      B498 B000-B599
#      B064 B000-B599
#      C131 C000-C599
#      C460 C000-C599
#      C099 C000-C599


structure(list(element = c("A374", "B498", "B064", "C131", "C460",  "C099", "C193", "E428", "E108", "E527", "E138", "E375", "E312",  "F046", "F417", "F094", "G142", "G461", "G068", "H372", "H523",  "H027", "H506", "I470", "I169", "I050", "I495", "I405", "J298",  "K165", "K169", "K131", "L510", "L210", "L277", "N257", "N554",  "N452", "N484", "N247", "N373", "N492", "O347", "O221", "O176",  "P578", "P477", "Q062", "Q257", "Q083", "R306", "S415", "S154",  "S226", "S400", "T132", "T181", "T321", "V109", "V118", "V267",  "W381", "W047", "X317", "X192", "Y390", "Y132", "Y327", "Y141",  "Y353", "Z429", "C981", "D813", "F934", "G910", "G673", "G664",  "I754", "I624", "L603", "N991", "N996", "O689", "O932", "P854",  "P689", "P761", "P681", "Q631", "S620", "T923", "T841", "U787",  "U929", "W942", "W702", "X770", "X880", "Y719", "Y969"), range = c("A000-A599",  "B000-B599", "B000-B599", "C000-C599", "C000-C599", "C000-C599",  "C000-C599", "E000-E599", "E000-E599", "E000-E599", "E000-E599",  "E000-E599", "E000-E599", "F000-F599", "F000-F599", "F000-F599",  "G000-G599", "G000-G599", "G000-G599", "H000-H599", "H000-H599",  "H000-H599", "H000-H599", "I000-I599", "I000-I599", "I000-I599",  "I000-I599", "I000-I599", "J000-J599", "K000-K599", "K000-K599",  "K000-K599", "L000-L599", "L000-L599", "L000-L599", "N000-N599",  "N000-N599", "N000-N599", "N000-N599", "N000-N599", "N000-N599",  "N000-N599", "O000-O599", "O000-O599", "O000-O599", "P000-P599",  "P000-P599", "Q000-Q599", "Q000-Q599", "Q000-Q599", "R000-R599",  "S000-S599", "S000-S599", "S000-S599", "S000-S599", "T000-T599",  "T000-T599", "T000-T599", "V000-V599", "V000-V599", "V000-V599",  "W000-W599", "W000-W599", "X000-X599", "X000-X599", "Y000-Y599",  "Y000-Y599", "Y000-Y599", "Y000-Y599", "Y000-Y599", "Z000-Z599",  "C600-C999", "D600-D999", "F600-F999", "G600-G999", "G600-G999",  "G600-G999", "I600-I999", "I600-I999", "L600-L999", "N600-N999",  "N600-N999", "O600-O999", "O600-O999", "P600-P999", "P600-P999",  "P600-P999", "P600-P999", "Q600-Q999", "S600-S999", "T600-T999",  "T600-T999", "U600-U999", "U600-U999", "W600-W999", "W600-W999",  "X600-X999", "X600-X999", "Y600-Y999", "Y600-Y999")), row.names = c(NA, 
-100L), class = c("data.table", "data.frame"))