我试图复制本文中的方法538 Post about Most Repetitive Phrases,其中作者挖掘了美国总统辩论的成绩单,以确定每位候选人最重复的短语。
我尝试使用tm
包中的R中的另一个数据集来实现此方法。
大多数代码(GitHub repository)都涉及挖掘每个ngram的成绩单和汇总计数,但我迷失在以下prune_substrings()
功能代码中:
def prune_substrings(tfidf_dicts, prune_thru=1000):
pruned = tfidf_dicts
for candidate in range(len(candidates)):
# growing list of n-grams in list form
so_far = []
ngrams_sorted = sorted(tfidf_dicts[candidate].items(), key=operator.itemgetter(1), reverse=True)[:prune_thru]
for ngram in ngrams_sorted:
# contained in a previous aka 'better' phrase
for better_ngram in so_far:
if overlap(list(better_ngram), list(ngram[0])):
#print "PRUNING!! "
#print list(better_ngram)
#print list(ngram[0])
pruned[candidate][ngram[0]] = 0
# not contained, so add to so_far to prevent future subphrases
else:
so_far += [list(ngram[0])]
return pruned
函数的输入tfidf_dicts
是一个字典数组(每个候选者一个),ngrams作为键,tf-idf得分作为值。例如,特朗普的tf-idf词典就像这样开始:
trump.tfidf.dict = {'we don't win': 83.2, 'you have to': 72.8, ... }
所以输入结构如下:
tfidf_dicts = {trump.tfidf.dict, rubio.tfidf.dict, etc }
我的理解是prune_substrings
做了以下事情,但我仍然坚持else if
条款,这是一个我不理解的pythonic事物。
一个。创建列表:修剪为tfidf_dicts;每个候选人的tfidf dicts列表
B遍历每个候选人:
- so_far =启动一个空的ngrams列表,所以so_far
- ngrams_sorted =已排序的成员&t; tf-idf dict从最小到最大
- 循环遍历每个ngram的排序
醇>
- 遍历so_far中的每个better_ngram
- IF重叠b / w(下方)== TRUE:
- better_ngram(来自so_far)和
- ngram(来自ngrams_sorted)
- 然后将ngram
的tf-idf归零- ELSE if(WHAT ?!?)
- 将ngram添加到列表中,so_far
℃。 return pruned,即按顺序排序的唯一ngrams列表
非常感谢任何帮助!
答案 0 :(得分:1)
请注意代码中的缩进... else
与第二个for
对齐,而不是if
。这是for-else
构造,而不是if-else
。
在这种情况下,else
用于初始化内部循环,因为它将在so_far
第一次为空时执行,并且每次内部循环用完项目时迭代...
我不确定这是实现这些比较的最有效方法,但从概念上讲,您可以通过此片段了解流程:
s=[]
for j in "ABCD":
for i in s:
print i,
else:
print "\nelse"
s.append(j)
输出:
else
A
else
A B
else
A B C
else
我认为在R中有一种比嵌套循环更好的方法....
答案 1 :(得分:1)
n
ngrams。t
,其中列表的每个元素都是一个长度为n
的逻辑向量,表示所讨论的ngram是否与所有其他ngrams重叠(但修复1:x自动为false) )t
的每个元素组合到一个表t2
t2
行总和的元素为零
将元素1:n设置为FALSE(即不重叠)Ouala!
#' GetPrunedList
#'
#' takes a word freq df with columns Words and LenNorm, returns df of nonoverlapping strings
GetPrunedList <- function(wordfreqdf, prune_thru = 100) {
#take only first n items in list
tmp <- head(wordfreqdf, n = prune_thru) %>%
select(ngrams = Words, tfidfXlength = LenNorm)
#for each ngram in list:
t <- (lapply(1:nrow(tmp), function(x) {
#find overlap between ngram and all items in list (overlap = TRUE)
idx <- overlap(tmp[x, "ngrams"], tmp$ngrams)
#set overlap as false for itself and higher-scoring ngrams
idx[1:x] <- FALSE
idx
}))
#bind each ngram's overlap vector together to make a matrix
t2 <- do.call(cbind, t)
#find rows(i.e. ngrams) that do not overlap with those below
idx <- rowSums(t2) == 0
pruned <- tmp[idx,]
rownames(pruned) <- NULL
pruned
}
#' overlap
#' OBJ: takes two ngrams (as strings) and to see if they overlap
#' INPUT: a,b ngrams as strings
#' OUTPUT: TRUE if overlap
overlap <- function(a, b) {
max_overlap <- min(3, CountWords(a), CountWords(b))
a.beg <- word(a, start = 1L, end = max_overlap)
a.end <- word(a, start = -max_overlap, end = -1L)
b.beg <- word(b, start = 1L, end = max_overlap)
b.end <- word(b, start = -max_overlap, end = -1L)
# b contains a's beginning
w <- str_detect(b, coll(a.beg, TRUE))
# b contains a's end
x <- str_detect(b, coll(a.end, TRUE))
# a contains b's beginning
y <- str_detect(a, coll(b.beg, TRUE))
# a contains b's end
z <- str_detect(a, coll(b.end, TRUE))
#return TRUE if any of above are true
(w | x | y | z)
}