我正在进行投诉数据分析,在此过程中,我将采用文本摘要技术来减少不必要的文本尺寸并仅显示有用的文本。
我已经使用LDA-R中的潜在Dirichlet分配进行文本摘要,但是我无法充分发挥它的潜力。
对于最初的10行数据集,我无法以较低的精度获取摘要,但无法将其应用于所有数据集
library(igraph)
library(iterators)
#create a TCM using skip grams, we'll use a 5-word window
tcm <- CreateTcm(doc_vec = datacopy$Text,skipgram_window = 10,
verbose = FALSE,cpus = 2)
# LDA to get embeddings into probability space
embeddings <- FitLdaModel(dtm = tcm, k = 50, iterations = 300,
burnin = 180, alpha = 0.1,beta = 0.05, optimize_alpha = TRUE,
calc_likelihood = FALSE,calc_coherence = FALSE, calc_r2 = FALSE,cpus=2)
#Summarizer function
summarizer <- function(doc, gamma) {
# handle multiple docs at once
if (length(doc) > 1 )
return(sapply(doc, function(d) try(summarizer(d, gamma))))
# parse it into sentences
sent <- stringi::stri_split_boundaries(doc, type = "sentence")[[ 1 ]]
names(sent) <- seq_along(sent) # so we know index and order
# embed the sentences in the model
e <- CreateDtm(sent, ngram_window = c(1,1), verbose = FALSE, cpus = 2)
# remove any documents with 2 or fewer words
#e <- e[ rowSums(e) > 2 , ]
vocab <- intersect(colnames(e), colnames(gamma))
e <- e / rowSums(e)
e <- e[ , vocab ] %*% t(gamma[ , vocab ])
e <- as.matrix(e)
# get the pairwise distances between each embedded sentence
e_dist <- CalcHellingerDist(e)
# turn into a similarity matrix
g <- (1 - e_dist) * 100
# we don't need sentences connected to themselves
diag(g) <- 0
# turn into a nearest-neighbor graph
g <- apply(g, 1, function(x){
x[ x < sort(x, decreasing = TRUE)[ 3 ] ] <- 0
x
})
# by taking pointwise max, we'll make the matrix symmetric again
g <- pmax(g, t(g))
g <- graph.adjacency(g, mode = "undirected", weighted = TRUE)
# calculate eigenvector centrality
ev <- evcent(g)
# format the result
result<-sent[names(ev$vector)[order(ev$vector,decreasing=TRUE)[1:3]]]
result <- result[ order(as.numeric(names(result))) ]
paste(result, collapse = " ")
}
docs <- datacopy$Text[1:10]
names(docs) <- datacopy$Reference[1:10]
sums <- summarizer(docs,gamma = embeddings$gamma)
sums
错误-
Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...) :
'x' must be an array of at least two dimensions
Error in if (nrow(adjmatrix) != ncol(adjmatrix)) { :
argument is of length zero
Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...) :
'x' must be an array of at least two dimensions
Error in if (nrow(adjmatrix) != ncol(adjmatrix))
{:argument is of length zero
Error in if (nrow(adjmatrix) != ncol(adjmatrix))
{:argument is of length zero
实际文本: 亲爱的杰德,南华克市政厅,请找到我今天从阿斯顿收到的来自ABC Ltd的短信,其中明确指出,市政厅有责任处理井盖松动。 您能提供理事会采取的后续步骤的最新信息吗?
**随后是Trail Mails文本-大约50行文本**
摘要文本:“请找到我今天从美国广播公司(ABC)从阿斯顿(Aston)收到的附件消息,其中明确指出,理事会有责任处理井盖松动。我已阅读了邮件主题,请与ABC提供的电话号码联系;请致电我们的24小时客户服务中心,电话为0800 316 *** /; 0800 ****。由于TFL仍需要与我联系,而市议会声称这是泰晤士水务所需要的处理它(!!!!!)。”