我想知道不同包的结果,因此,算法是如何不同的,以及是否可以以产生类似主题的方式设置参数。我特别看了一下包text2vec
和topicmodels
。
我使用下面的代码来比较使用这些包生成的10个主题(请参阅条款的代码部分)。我无法生成具有相似含义的主题集。例如。来自text2vec
的主题10与"警察"有关,topicmodels
生成的主题均未引用"警察"或类似的术语。此外,我无法确定topicmodels
所产生的主题5的链接与生命 - 爱 - 家庭 - 战争"在text2vec
生成的主题中。
我是LDA的初学者,因此,对于有经验的程序员来说,我的理解可能听起来很幼稚。然而,直观地,人们会认为应该可以产生具有相似含义的主题集来证明结果的有效性/稳健性。当然,不一定是完全相同的术语集,而是针对类似主题的术语表。
也许问题只是我对这些术语表的人类解释不足以捕捉相似之处,但也许有一些参数可能会增加人类解释的相似性。有人可以指导我如何设置参数来实现这一目标或以其他方式提供解释或提示合适的资源,以提高我对此事的理解吗?
这里有一些可能相关的问题:
text2vec
不使用标准的Gibbs采样,而是使用WarpLDA,这已经是算法与topcimodels
的区别。如果我的理解是正确的,alpha
中使用的先验delta
和topicmodels
分别在doc_topic_prior
中设置为topic_word_prior
和text2vec
。lambda
根据频率对主题的术语进行排序。我还没有理解,topicmodels
中的术语如何排序 - 与设置lambda=1
相当? (我尝试过0到1之间不同的lambdas而没有得到类似的主题)seed
(see, e.g., this question),也难以生成完全可重现的示例。这不是我的问题,但可能会让回复更加困难。对于长篇问题感到抱歉,并提前感谢您的任何帮助或建议。
Update2:我已将第一次更新的内容转移到基于更完整分析的答案中。
更新:根据text2vec
软件包创建者Dmitriy Selivanov的有用评论,我可以确认设置lambda=1
会增加所生成的字词列表之间的主题相似度这两个包。
此外,我仔细查看了两个软件包产生的术语表之间的差异,通过快速检查主题length(setdiff())
和length(intersect())
(参见下面的代码)。这个粗略的检查显示text2vec
丢弃了每个主题的几个术语 - 可能是个别主题的概率阈值? topicmodels
保留所有主题的所有条款。这解释了(可由人类)从术语列表中得出的部分意义差异。
如上所述,生成可重现的示例似乎很困难,因此我没有调整下面代码中的所有数据示例。由于运行时间很短,任何人都可以查看他/她自己的系统。
library(text2vec)
library(topicmodels)
library(slam) #to convert dtm to simple triplet matrix for topicmodels
ntopics <- 10
alphaprior <- 0.1
deltaprior <- 0.001
niter <- 1000
convtol <- 0.001
set.seed(0) #for text2vec
seedpar <- 0 #for topicmodels
#Generate document term matrix with text2vec
tokens = movie_review$review[1:1000] %>%
tolower %>%
word_tokenizer
it = itoken(tokens, ids = movie_review$id[1:1000], progressbar = FALSE)
vocab = create_vocabulary(it) %>%
prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
vectorizer = vocab_vectorizer(vocab)
dtm = create_dtm(it, vectorizer, type = "dgTMatrix")
#LDA model with text2vec
lda_model = text2vec::LDA$new(n_topics = ntopics
,doc_topic_prior = alphaprior
,topic_word_prior = deltaprior
)
doc_topic_distr = lda_model$fit_transform(x = dtm
,n_iter = niter
,convergence_tol = convtol
,n_check_convergence = 25
,progressbar = FALSE
)
#LDA model with topicmodels
ldatopicmodels <- LDA(as.simple_triplet_matrix(dtm), k = ntopics, method = "Gibbs",
LDA_Gibbscontrol = list(burnin = 100
,delta = deltaprior
,alpha = alphaprior
,iter = niter
,keep = 50
,tol = convtol
,seed = seedpar
,initialize = "seeded"
)
)
#show top 15 words
lda_model$get_top_words(n = 10, topic_number = c(1:10), lambda = 0.3)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] "finally" "men" "know" "video" "10" "king" "five" "our" "child" "cop"
# [2,] "re" "always" "ve" "1" "doesn" "match" "atmosphere" "husband" "later" "themselves"
# [3,] "three" "lost" "got" "head" "zombie" "lee" "mr" "comedy" "parents" "mary"
# [4,] "m" "team" "say" "girls" "message" "song" "de" "seem" "sexual" "average"
# [5,] "gay" "here" "d" "camera" "start" "musical" "may" "man" "murder" "scenes"
# [6,] "kids" "within" "funny" "kill" "3" "four" "especially" "problem" "tale" "police"
# [7,] "sort" "score" "want" "stupid" "zombies" "dance" "quality" "friends" "television" "appears"
# [8,] "few" "thriller" "movies" "talking" "movies" "action" "public" "given" "okay" "trying"
# [9,] "bit" "surprise" "let" "hard" "ask" "fun" "events" "crime" "cover" "waiting"
# [10,] "hot" "own" "thinking" "horrible" "won" "tony" "u" "special" "stan" "lewis"
# [11,] "die" "political" "nice" "stay" "open" "twist" "kelly" "through" "uses" "imdb"
# [12,] "credits" "success" "never" "back" "davis" "killer" "novel" "world" "order" "candy"
# [13,] "two" "does" "bunch" "didn" "completely" "ending" "copy" "show" "strange" "name"
# [14,] "otherwise" "beauty" "hilarious" "room" "love" "dancing" "japanese" "new" "female" "low"
# [15,] "need" "brilliant" "lot" "minutes" "away" "convincing" "far" "mostly" "girl" "killing"
terms(ldatopicmodels, 10)
# Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
# [1,] "show" "where" "horror" "did" "life" "such" "m" "films" "man" "seen"
# [2,] "years" "minutes" "pretty" "10" "young" "character" "something" "music" "new" "movies"
# [3,] "old" "gets" "best" "now" "through" "while" "re" "actors" "two" "plot"
# [4,] "every" "guy" "ending" "why" "love" "those" "going" "role" "though" "better"
# [5,] "series" "another" "bit" "saw" "woman" "does" "things" "performance" "big" "worst"
# [6,] "funny" "around" "quite" "didn" "us" "seems" "want" "between" "back" "interesting"
# [7,] "comedy" "nothing" "little" "say" "real" "book" "thing" "love" "action" "your"
# [8,] "again" "down" "actually" "thought" "our" "may" "know" "play" "shot" "money"
# [9,] "tv" "take" "house" "still" "war" "work" "ve" "line" "together" "hard"
# [10,] "watching" "these" "however" "end" "father" "far" "here" "actor" "against" "poor"
# [11,] "cast" "fun" "cast" "got" "find" "scenes" "doesn" "star" "title" "least"
# [12,] "long" "night" "entertaining" "2" "human" "both" "look" "never" "go" "say"
# [13,] "through" "scene" "must" "am" "shows" "yet" "isn" "played" "city" "director"
# [14,] "once" "back" "each" "done" "family" "audience" "anything" "hollywood" "came" "probably"
# [15,] "watched" "dead" "makes" "3" "mother" "almost" "enough" "always" "match" "video"
#UPDATE
#number of terms in each model is the same
length(ldatopicmodels@terms)
# [1] 2170
nrow(vocab)
# [1] 2170
#number of NA entries for termlist of first topic differs
sum(is.na(
lda_model$get_top_words(n = nrow(vocab), topic_number = c(1:10), lambda = 1)[,1]
)
)
#[1] 1778
sum(is.na(
terms(ldatopicmodels, length(ldatopicmodels@terms))
)
)
#[1] 0
#function to check number of terms that differ between two sets of topic collections (excluding NAs)
lengthsetdiff <- function(x, y) {
apply(x, 2, function(i) {
apply(y, 2, function(j) {
length(setdiff(i[!is.na(i)],j[!is.na(j)]))
})
})
}
#apply the check
termstopicmodels <- terms(ldatopicmodels,length(ldatopicmodels@terms))
termstext2vec <- lda_model$get_top_words(n = nrow(vocab), topic_number = c(1:10), lambda = 1)
lengthsetdiff(termstopicmodels,
termstopicmodels)
# Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
# Topic 1 0 0 0 0 0 0 0 0 0 0
# Topic 2 0 0 0 0 0 0 0 0 0 0
# Topic 3 0 0 0 0 0 0 0 0 0 0
# Topic 4 0 0 0 0 0 0 0 0 0 0
# Topic 5 0 0 0 0 0 0 0 0 0 0
# Topic 6 0 0 0 0 0 0 0 0 0 0
# Topic 7 0 0 0 0 0 0 0 0 0 0
# Topic 8 0 0 0 0 0 0 0 0 0 0
# Topic 9 0 0 0 0 0 0 0 0 0 0
# Topic 10 0 0 0 0 0 0 0 0 0 0
lengthsetdiff(termstext2vec,
termstext2vec)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 0 340 318 335 292 309 320 355 294 322
# [2,] 355 0 321 343 292 319 311 346 302 339
# [3,] 350 338 0 316 286 309 311 358 318 322
# [4,] 346 339 295 0 297 310 301 335 309 332
# [5,] 345 330 307 339 0 310 310 354 309 333
# [6,] 350 345 318 340 298 0 311 342 308 325
# [7,] 366 342 325 336 303 316 0 364 311 325
# [8,] 355 331 326 324 301 301 318 0 311 335
# [9,] 336 329 328 340 298 309 307 353 0 314
# [10,] 342 344 310 341 300 304 299 355 292 0
lengthsetdiff(termstopicmodels,
termstext2vec)
# Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
# [1,] 1778 1778 1778 1778 1778 1778 1778 1778 1778 1778
# [2,] 1793 1793 1793 1793 1793 1793 1793 1793 1793 1793
# [3,] 1810 1810 1810 1810 1810 1810 1810 1810 1810 1810
# [4,] 1789 1789 1789 1789 1789 1789 1789 1789 1789 1789
# [5,] 1831 1831 1831 1831 1831 1831 1831 1831 1831 1831
# [6,] 1819 1819 1819 1819 1819 1819 1819 1819 1819 1819
# [7,] 1824 1824 1824 1824 1824 1824 1824 1824 1824 1824
# [8,] 1778 1778 1778 1778 1778 1778 1778 1778 1778 1778
# [9,] 1820 1820 1820 1820 1820 1820 1820 1820 1820 1820
# [10,] 1798 1798 1798 1798 1798 1798 1798 1798 1798 1798
lengthsetdiff(termstext2vec,
termstopicmodels)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# Topic 1 0 0 0 0 0 0 0 0 0 0
# Topic 2 0 0 0 0 0 0 0 0 0 0
# Topic 3 0 0 0 0 0 0 0 0 0 0
# Topic 4 0 0 0 0 0 0 0 0 0 0
# Topic 5 0 0 0 0 0 0 0 0 0 0
# Topic 6 0 0 0 0 0 0 0 0 0 0
# Topic 7 0 0 0 0 0 0 0 0 0 0
# Topic 8 0 0 0 0 0 0 0 0 0 0
# Topic 9 0 0 0 0 0 0 0 0 0 0
# Topic 10 0 0 0 0 0 0 0 0 0 0
#also the intersection can be checked between the two sets
lengthintersect <- function(x, y) {
apply(x, 2, function(i) {
apply(y, 2, function(j) {
length(intersect(i[!is.na(i)], j[!is.na(j)]))
})
})
}
lengthintersect(termstopicmodels,
termstext2vec)
# Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
# [1,] 392 392 392 392 392 392 392 392 392 392
# [2,] 377 377 377 377 377 377 377 377 377 377
# [3,] 360 360 360 360 360 360 360 360 360 360
# [4,] 381 381 381 381 381 381 381 381 381 381
# [5,] 339 339 339 339 339 339 339 339 339 339
# [6,] 351 351 351 351 351 351 351 351 351 351
# [7,] 346 346 346 346 346 346 346 346 346 346
# [8,] 392 392 392 392 392 392 392 392 392 392
# [9,] 350 350 350 350 350 350 350 350 350 350
# [10,] 372 372 372 372 372 372 372 372 372 372