理解另一个删除类似字符串的文本挖掘函数

时间:2016-05-03 06:28:58

标签: python r dictionary nlp tm

我试图复制本文中的方法538 Post about Most Repetitive Phrases,其中作者挖掘了美国总统辩论的成绩单,以确定每位候选人最重复的短语。

我尝试使用tm包中的R中的另一个数据集来实现此方法。

大多数代码(GitHub repository)都涉及挖掘每个ngram的成绩单和汇总计数,但我迷失在以下prune_substrings()功能代码中:

def prune_substrings(tfidf_dicts, prune_thru=1000):

    pruned = tfidf_dicts

    for candidate in range(len(candidates)):
        # growing list of n-grams in list form
        so_far = []

        ngrams_sorted = sorted(tfidf_dicts[candidate].items(), key=operator.itemgetter(1), reverse=True)[:prune_thru]
        for ngram in ngrams_sorted:
            # contained in a previous aka 'better' phrase
            for better_ngram in so_far:
                if overlap(list(better_ngram), list(ngram[0])):
                    #print "PRUNING!! "
                    #print list(better_ngram)
                    #print list(ngram[0])

                    pruned[candidate][ngram[0]] = 0
            # not contained, so add to so_far to prevent future subphrases
            else:
                so_far += [list(ngram[0])]

    return pruned 

函数的输入tfidf_dicts是一个字典数组(每个候选者一个),ngrams作为键,tf-idf得分作为值。例如,特朗普的tf-idf词典就像这样开始:

trump.tfidf.dict = {'we don't win': 83.2, 'you have to': 72.8, ... }

所以输入结构如下:

tfidf_dicts = {trump.tfidf.dict, rubio.tfidf.dict, etc }

我的理解是prune_substrings做了以下事情,但我仍然坚持else if条款,这是一个我不理解的pythonic事物。

  

一个。创建列表:修剪为tfidf_dicts;每个候选人的tfidf dicts列表

     

B遍历每个候选人:

     
      
  1. so_far =启动一个空的ngrams列表,所以so_far
  2.   
  3. ngrams_sorted =已排序的成员&t; tf-idf dict从最小到最大
  4.   
  5. 循环遍历每个ngram的排序      
        
    • 遍历so_far中的每个better_ngram      
          
      1. IF重叠b / w(下方)== TRUE:      
            
        • better_ngram(来自so_far)和
        •   
        • ngram(来自ngrams_sorted)
        •   
        • 然后将ngram
        • 的tf-idf归零   
      2.   
      3. ELSE if(WHAT ?!?)      
            
        • 将ngram添加到列表中,so_far
        •   
      4.   
    •   
  6.         

    ℃。 return pruned,即按顺序排序的唯一ngrams列表

非常感谢任何帮助!

2 个答案:

答案 0 :(得分:1)

请注意代码中的缩进... else与第二个for对齐,而不是if。这是for-else构造,而不是if-else

在这种情况下,else用于初始化内部循环,因为它将在so_far第一次为空时执行,并且每次内部循环用完项目时迭代...

我不确定这是实现这些比较的最有效方法,但从概念上讲,您可以通过此片段了解流程:

s=[]
for j in "ABCD":
   for i in s:
      print i,
   else:
       print "\nelse"
       s.append(j)

输出:

else
A 
else
A B 
else
A B C 
else

我认为在R中有一种比嵌套循环更好的方法....

答案 1 :(得分:1)

4个月后,这是我的解决方案。我确信有一个更有效的解决方案,但就我的目的而言,它有效。 pythonic for-else不会转换为R.所以步骤不同。

  1. 取得最高n ngrams。
  2. 创建一个列表t,其中列表的每个元素都是一个长度为n的逻辑向量,表示所讨论的ngram是否与所有其他ngrams重叠(但修复1:x自动为false) )
  3. t的每个元素组合到一个表t2
  4. 仅返回t2行总和的元素为零 将元素1:n设置为FALSE(即不重叠)
  5. Ouala!

    PrunedList Function

    #' GetPrunedList
    #' 
    #' takes a word freq df with columns Words and LenNorm, returns df of nonoverlapping strings
    GetPrunedList <- function(wordfreqdf, prune_thru = 100) {
            #take only first n items in list
            tmp <- head(wordfreqdf, n = prune_thru) %>%
                    select(ngrams = Words, tfidfXlength = LenNorm)
            #for each ngram in list:
            t <- (lapply(1:nrow(tmp), function(x) {
                    #find overlap between ngram and all items in list (overlap = TRUE)
                    idx <- overlap(tmp[x, "ngrams"], tmp$ngrams)
                    #set overlap as false for itself and higher-scoring ngrams
                    idx[1:x] <- FALSE
                    idx
            }))
    
            #bind each ngram's overlap vector together to make a matrix
            t2 <- do.call(cbind, t)   
    
            #find rows(i.e. ngrams) that do not overlap with those below
            idx <- rowSums(t2) == 0
            pruned <- tmp[idx,]
            rownames(pruned) <- NULL
            pruned
    }
    

    重叠功能

    #' overlap
    #' OBJ: takes two ngrams (as strings) and to see if they overlap
    #' INPUT: a,b ngrams as strings
    #' OUTPUT: TRUE if overlap
    overlap <- function(a, b) {
            max_overlap <- min(3, CountWords(a), CountWords(b))
    
            a.beg <- word(a, start = 1L, end = max_overlap)
            a.end <- word(a, start = -max_overlap, end = -1L)
            b.beg <- word(b, start = 1L, end = max_overlap)
            b.end <- word(b, start = -max_overlap, end = -1L)
    
            # b contains a's beginning
            w <- str_detect(b, coll(a.beg, TRUE))
            # b contains a's end
            x <- str_detect(b, coll(a.end, TRUE))
            # a contains b's beginning
            y <- str_detect(a, coll(b.beg, TRUE))
            # a contains b's end
            z <- str_detect(a, coll(b.end, TRUE))
    
            #return TRUE if any of above are true
            (w | x | y | z)
    }