Question

我正在尝试转换LDA预测结果，这是一个list对象，包含分配给文档中每个标记的数百个list（主题（在numeric中）），例如以下示例

assignments <- list(
  as.integer(c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3)),
  as.integer(c(1, 1, 1, 2, 2, 2, 2, 2, 3, 3)),
  as.integer(c(1, 3, 3, 3, 3, 3, 3, 2, 2))
)

其中list对象的每个列表具有与每个标记化文档的长度相对应的不同长度。

我想要做的是1）从每个列表中获取最常见的主题（1,2,3），以及2）将它们转换为tbl或data.frame格式，如下所示< / p>

document  topic   freq
   1        1       6
   2        2       5
   3        3       6

这样我就可以使用inner_join()来合并这个＆＃34;达成共识＆＃34;使用tm或topicmodels应用生成的主题分配结果进行预测并比较它们的精确度等。由于assignments采用list格式，我无法应用{{1}函数来获取每个列表最常用的主题。我试着唱top_n()，但它并没有给我我想要的东西。

Answer 1

您可以使用locals()遍历列表，使用sapply获取频率并从排序结果中提取第一个值：

table

result <- sapply(assignments, function(x) sort(table(x), decreasing = TRUE)[1])
data.frame(document = seq_along(assignments),
           topic = as.integer(names(result)),
           freq = result)

Answer 2

我们可以遍历list，获取tabulate元素的频率，找到最大元素的索引，将其与频率一起提取为data.frame和{{1 } rbind元素

list

或另一种选择是将其转换为两列数据集，然后分组以查找最大值的索引

do.call(rbind,  lapply(seq_along(assignments), function(i) {
        x <- assignments[[i]]
        ux <- unique(x)
        i1 <- tabulate(match(x, ux))
    data.frame(document = i, topic = ux[which.max(i1)], freq = max(i1))})
 )
#    document topic freq
#1        1     1    6
#2        2     2    5
#3        3     3    6

或者我们可以使用library(data.table) setDT(stack(setNames(assignments, seq_along(assignments))))[, .(freq = .N), .(document = ind, topic = values)][, .SD[freq == max(freq)], document] # document topic freq #1: 1 1 6 #2: 2 2 5 #3: 3 3 6

tidyverse

Answer 3

使用purrr::imap_dfr：

library(tidyverse)
imap_dfr(assignments,~ tibble(
  document = .y,
  Topic = names(which.max(table(.x))),
  freq  = max(tabulate(.x))))

# # A tibble: 3 x 3
#   document Topic  freq
#      <int> <chr> <int>
# 1        1     1     6
# 2        2     2     5
# 3        3     3     6

从列表列表中获取最常见的值

3 个答案: