Question

编辑：我最后通过避免“在seq_along（）中”并使用更熟悉的“1：（nrow（df））”来使我的for（）循环工作。另外（关键的），我通过在if（）体中插入break语句使其更有效：

for (i in 1:(nrow(urwiki))){
  for(j in 1:(nrow(unique_names))){
    if(identical(unique_names[[j, "editor"]], urwiki[[i,"editor"]]) ){
      cohort_vector[[i]] <- unique_names[[j, "cohort"]]
      break
    }
  }

}

然而，它仍然需要一个多小时（760,000行* 11,000个可能的匹配=最坏情况下8万亿左右）所以如果有人能告诉我将来如何“矢量化”这个操作，我将不胜感激。

原始问题如下......

我想基于dataframe / tibble创建分类向量。所讨论的“级别”是字符串，所以也许有更好的方法可以做到这一点，但我有一个for（+）循环的没有运气，并且读到它通常更好地处理向量R中的方法我试过使用这里找到的lapply（）方法：

Applying the same factor levels to multiple variables in an R data frame

urwiki["editor"] <- lapply(urwiki["editor"], factor, 
           levels = unique_names$editor, 
           labels = unique_names$cohort)

我收到的错误报告说，尝试的标签向量是一个值太长：

Error in FUN(X[[i]], ...) : 
invalid 'labels'; length 12863 should be 1 or 12862

级别和标签输入都来自同一个数据帧，其高度为12863，那么为什么它需要的矢量长度少一个？

我也在purrr包中试过这个：

cohort_vector <- map_int(urwiki$editor, factor,
                     levels = unique_names$editor, 
                     labels = unique_names$cohort)

带有相应的错误：

Error in .f(.x[[i]], ...) : 
invalid 'labels'; length 12863 should be 1 or 12862

the tibble：

urwiki <- structure(list(articleid = c("4", "4", "4", "4", "4", "4"), 
date_time = c("1/27/2004 17:36", 
"2/20/2004 13:40", "3/3/2004 18:31", "3/3/2004 18:47", "3/3/2004 18:55", 
"3/3/2004 19:01"), editor = c("Steve", "Jim", 
"Terry", "Steve", "Rachel", "Harvey"
), year = c("2004", "2004", "2004", "2004", "2004", "2004")), .Names = 
c("articleid", 
"date_time", "editor", "year"), row.names = c(NA, -6L), class = 
c("grouped_df", 
"tbl_df", "tbl", "data.frame"), vars = "year", drop = TRUE, indices = list(
    0:5), group_sizes = 6L, biggest_group_size = 6L, labels = 
structure(list(
    year = "2004"), row.names = c(NA, -1L), class = "data.frame", vars = 
"year", drop = TRUE, .Names = "year"))

tibble看起来像这样：

 anon articleid       date_time deleted          editor
 <lgl>     <int>           <chr>   <lgl>           <chr>
 TRUE         4 1/27/2004 17:36   FALSE           Steve
 TRUE         4 2/20/2004 13:40   FALSE             Jim
 TRUE         4  3/3/2004 18:31   FALSE           Terry
 TRUE         4  3/3/2004 18:47   FALSE           Steve
 TRUE         4  3/3/2004 18:55   FALSE          Rachel

我制作了一个单独的元素，用于识别每个独特的编辑器以及它们首次出现的年份：

unique_names <- structure(list(cohort = c("2004", "2004", "2004", "2004", 
"2004", "2004"), editor = c("Jim", "Steve", "Harvey", "Rachel", "Terry", 
"139.164.251.34"), n = c(65L, 2L, 1L, 1L, 1L, 9L)), 
.Names = c("cohort", "editor", 
 "n"), row.names = c(NA, -6L), class = c("grouped_df", "tbl_df", 
"tbl", "data.frame"), vars = c("cohort", "editor"), drop = TRUE, indices = 
list(
    0L, 1L, 2L, 3L, 4L, 5L), group_sizes = c(1L, 1L, 1L, 1L, 
1L, 1L), biggest_group_size = 1L, labels = structure(list(cohort = c("2004", 

"2004", "2004", "2004", "2004", "2004"), editor = c("Jim", 
"Steve", "Harvey", 
"Rachel", "Terry", "139.164.251.34")), row.names = c(NA, 
-6L), class = "data.frame", vars = c("cohort", "editor"), drop = TRUE, 
.Names = c("cohort", 
"editor")))

看起来像：

cohort                                           editor
<chr>                                            <chr>
2004                                            Jim
2004                                            Steve
2004                                            Harvey

所以我试图使一个向量成为原始集合的长度，通过其队列识别每个编辑器。然后我可以将该向量添加到原始tibble，以将每一行与编辑器的同类群相关联，而不仅仅是创建它的年份。在这个例子中，向量只是6“2004”的向量。

当我在上面的head（）数据上运行map_int函数时，它不会给我错误，但也不会返回我需要的向量。

我前面提到的for（）循环看起来像这样：

cohort_vector <- vector("integer", nrow(urwiki2))
for (i in seq_along(urwiki2)){
  for(j in seq_along(unique_names)){
    if(identical(unique_names[[j, "editor"]], urwiki2[[i,"editor"]]) ){
      cohort_vector[[i]] <- unique_names[[j, "cohort"]]
    }
  }

}

此for循环适用于上面的示例数据，但不重复名称，即。第二个“史蒂夫”将不匹配，值将返回0。但是，当我用我的实际数据集（700,000+行）运行它时，我最终得到了一个700,000个零的向量。

试图在R中应用因子向量，错误一直期望向量长度减少一

0 个答案: