
时间:2019-03-04 16:27:24

标签: r performance nlp


我需要a)将所有字符串拆分为双字母组,b)将一个数据集的每组双字母组与另一数据集的每组双字母组进行比较(一组是来自初始字符向量中单个元素的双字母组) ),然后c)输出二元组重叠量度。


rougenscore <- function(sys, ref, n = 2, metric = c("f-score","precision","recall")) {
  metric <- match.arg(metric)
  tidysys <- enframe(sys, name = "ID", value = "text") %>%
    unnest_tokens(ngram, text, token = "ngrams", n = n)
  tidyref <- enframe(ref, name = "ID", value = "text") %>%
    unnest_tokens(ngram, text, token = "ngrams", n = n)
  if (metric %in% c("f-score","precision")) {
    precision <- matrix(nrow = length(sys), ncol = length(ref), dimnames = list(sys, ref))
  if (metric %in% c("f-score","recall")) {
    recall <- matrix(nrow = length(sys), ncol = length(ref), dimnames = list(sys, ref))

  #Loops through all elements in sys and all elements in ref and compares them pairwise
  for (i in 1:length(sys)) {
    for (j in 1:length(ref)) {
      seti <- tidysys %>%
        filter(ID == i) %>%
        distinct(ngram) #We'll be using sets and set theory here, hence distinct
      setj <- tidyref %>%
        filter(ID == j) %>%
      if (metric %in% c("f-score","precision")) {
        precision[i,j] <- length(intersect(seti$ngram, setj$ngram))/length(seti$ngram)
      if (metric %in% c("f-score","recall")) {
        recall[i,j] <- length(intersect(seti$ngram, setj$ngram))/length(setj$ngram)
  #This switch case returns the requested metric (as a matrix)
  switch (metric,
          "f-score" = (2 * (precision * recall) / (precision + recall)),
          "recall" = recall,
          "precision" = precision

我已经确认它可以工作,但是for循环非常慢,我正在尝试将大约4000个字符串的向量与大约6000个字符串的向量进行比较。这是我与其他语言一起使用ROUGE的回忆,这只是一个非常耗时的过程,但是我想尽可能地加快速度。不关心bigram标记化,因为它只是预先完成一次,但是我想知道是否有比intersect()更好的方法来计算两个字符向量中相同元素的数量。还是可以在dplyr之外加快filter() / distinct()的呼叫?


0 个答案:
