Question

我有一个如下所示的单词列表：

[[1]] [1]“快速”“棕色”“狐狸”“跳”

[[2]] [1]“一次”“上”“午夜”“沉闷”

[[3]] [1]“quoth”“raven”“nevermore”

我要做的是根据此列表创建一组双字母组合（这比示例大得多）。我的代码在下面，它的工作原理。问题是它太慢了。我知道我在bigram_list（）中的循环正在减慢它，但我无法弄清楚如何使用lapply与列表。我也怀疑数据表和rbindlist（）会有很多帮助，但我似乎无法让它们工作。任何帮助都会遇到哈利路亚的合唱。

bigram_list <- function(words, ng) {
    tokens <- character()

    words_length <- length(words)
    for (i in 1:words_length) {
        set <- words[[i]]          ##current vector of words
        end <- length(set)
        if (end > 2) {
            if (ng == 1) {
                nlist <- set[1:(end-1)]
            } else {
                nlist <- set[2:end]
            }
            tokens <- c(tokens,nlist) 
        }
    }
    return(tokens)
}

make_bigram <- function(words) {
    n1s <- character()
    n2s <- character()

    n1.set <- bigram_list(words, 1)
    n1s <- c(n1s, n1.set)

    n2.set <- bigram_list(words, 2)
    n2s <- c(n2s, n2.set)

    bigrams <- data.frame(n1=n1s, n2=n2s)
    return(bigrams)
}

Answer 1

我在Python中这样做的方式是：

zip(tokens[:-1],tokens[1:])

那里有两件事：列表切片和拉链。

列表切片可让您剪切列表的片段。在这种情况下，我们实际上分别从列表中删除了最后一个和第一个令牌，但只选择除了最后一个以及除了切片中的第一个令牌之外的所有令牌。

zip结合了两个列表，每个列表中的并行条目形成一个元组; R中的mapply似乎与我们的目的相似。

在R中，这似乎有效：

> tokens <- list("I", "am", "a", "banana",".")
> mapply(list,tokens[1:length(tokens)-1], tokens[2:length(tokens)])
     [,1] [,2] [,3]     [,4]    
[1,] "I"  "am" "a"      "banana"
[2,] "am" "a"  "banana" "."

Answer 2

要获得双字母，您可以使用ngram包。它非常快，因为它是用C语言编写的。但是，目前的版本似乎只允许在没有各自频率的情况下输出n-gram。

要获得可以使用tau包的频率。除其他外，textcnt()可以返回n-gram及其频率。

在下面的基准测试中，我们可以看到ngram()比textcnt()快得多，但这里的比较并不完全公平，因为textcnt()也会计算频率。

text <- list("This is just an example for Stackoverflow",
             "The quick brown fox jumps over the lazy dog",
             "And another example sentence to test the function",
             "This is the end of the list")
# Repeat to make the list larger
text <- rep(text, 1000)

### Using the tau package --------------------------------
library(tau)
# Find and count n-grams. Treats list elements separately.
# (For example there is no bigram "stackoverflow the")
ngrams <- textcnt(text, method = "string", n = 2L, decreasing = T)
# > head(ngrams, 3)
# this is      an example     and another
# 2000            1000            1000

## Apply over separate list elements
ngrams <- lapply(text, 
                 function(x) textcnt(x, 
                                     method = "string", 
                                     n = 2L, 
                                     decreasing = T))

### Using the ngram package -------------------------------
# This works only on a string, not a list. 
# Does not return frequencies
library(ngram)
ngram <- ngram(paste(unlist(text), collapse = " "), 
               n = 2L)
ngram <- get.ngrams(ngram)
# > head(ngram, 3)
# [1] "just an"     "The quick"   "is the" 

### Benchmarks ---------------------------------------------
library(microbenchmark)
microbenchmark(ngram(paste(unlist(text), collapse = " "), 
                              n = 2L))
# mean 34.3, median 33.7
microbenchmark(textcnt(text, method = "string", n = 2L, decreasing = T))
# mean 223.7, median 220.6

需要加快从文本列表创建bigrams的功能

2 个答案: