根据复述检测查找类似的文本

时间:2014-01-18 15:25:14

标签: nlp nltk text-mining semantic-analysis

我有兴趣根据释义找到类似的内容(文本)。我该怎么做呢? 有没有特定的工具可以做到这一点?在python中最好。

3 个答案:

答案 0 :(得分:5)

我相信您正在寻找的工具是潜在语义分析。

鉴于我的帖子将会非常冗长,我不会详细解释它背后的理论 - 如果你认为它确实是你想要的,我建议你查阅它。最好的起点是:

http://staff.scm.uws.edu.au/~lapark/lt.pdf

总之,LSA试图基于相似单词出现在类似文档中的假设来揭示单词和短语的潜在/潜在含义。我将使用R来演示它的工作原理。

我将设置一个根据潜在意义检索类似文档的函数:

# Setting up all the needed functions:

SemanticLink = function(text,expression,LSAS,n=length(text),Out="Text"){ 

  # Query Vector
  LookupPhrase = function(phrase,LSAS){ 
    lsatm = as.textmatrix(LSAS) 
    QV = function(phrase){ 
      q = query(phrase,rownames(lsatm)) 
      t(q)%*%LSAS$tk%*%diag(LSAS$sk) 
    } 

    q = QV(phrase) 
    qd = 0 

    for (i in 1:nrow(LSAS$dk)){ 
      qd[i] <- cosine(as.vector(q),as.vector(LSAS$dk[i,])) 
    }  
    qd  
  } 

  # Handling Synonyms
  Syns = function(word){   
    wl    =   gsub("(.*[[:space:]].*)","", 
                   gsub("^c\\(|[[:punct:]]+|^[[:space:]]+|[[:space:]]+$","", 
                        unlist(strsplit(PlainTextDocument(synonyms(word)),",")))) 
    wl = wl[wl!=""] 
    return(wl)  
  } 

  ex = unlist(strsplit(expression," "))
  for(i in seq(ex)){ex = c(ex,Syns(ex[i]))}
  ex = unique(wordStem(ex))

  cache = LookupPhrase(paste(ex,collapse=" "),LSAS) 

  if(Out=="Text"){return(text[which(match(cache,sort(cache,decreasing=T)[1:n])!="NA")])} 
  if(Out=="ValuesSorted"){return(sort(cache,decreasing=T)[1:n]) } 
  if(Out=="Index"){return(which(match(cache,sort(cache,decreasing=T)[1:n])!="NA"))} 
  if(Out=="ValuesUnsorted"){return(cache)} 

} 

请注意,我们在汇总查询向量时会使用同义词。这种方法并不完美,因为qdap库中的某些同义词最多只是远程...这可能会干扰您的搜索查询,因此为了获得更准确但不太通用的结果,您可以简单地删除同义词位并手动选择构成查询向量的所有相关术语。

我们来试试吧。我还将使用包RTextTools

中的美国国会数据集
library(tm)
library(RTextTools)
library(lsa)
library(data.table)
library(stringr)
library(qdap)

data(USCongress)

text = as.character(USCongress$text)

corp = Corpus(VectorSource(text)) 

parameters = list(minDocFreq        = 1, 
                  wordLengths       = c(2,Inf), 
                  tolower           = TRUE, 
                  stripWhitespace   = TRUE, 
                  removeNumbers     = TRUE, 
                  removePunctuation = TRUE, 
                  stemming          = TRUE, 
                  stopwords         = TRUE, 
                  tokenize          = NULL, 
                  weighting         = function(x) weightSMART(x,spec="ltn"))

tdm = TermDocumentMatrix(corp,control=parameters)
tdm.reduced = removeSparseTerms(tdm,0.999)

# setting up LSA space - this may take a little while...
td.mat = as.matrix(tdm.reduced) 
td.mat.lsa = lw_bintf(td.mat)*gw_idf(td.mat) # you can experiment with weightings here
lsaSpace = lsa(td.mat.lsa,dims=dimcalc_raw()) # you don't have to keep all dimensions
lsa.tm = as.textmatrix(lsaSpace)

l = 50 
exp = "support trade" 
SemanticLink(text,exp,n=5,lsaSpace,Out="Text") 

[1] "A bill to amend the Internal Revenue Code of 1986 to provide tax relief for small businesses, and for other purposes."                                                                       
[2] "A bill to authorize the Secretary of Transportation to issue a certificate of documentation with appropriate endorsement for employment in the coastwise trade for the vessel AJ."           
[3] "A bill to authorize the Secretary of Transportation to issue a certificate of documentation with appropriate endorsement for employment in the coastwise trade for the yacht EXCELLENCE III."
[4] "A bill to authorize the Secretary of Transportation to issue a certificate of documentation with appropriate endorsement for employment in the coastwise trade for the vessel M/V Adios."    
[5] "A bill to amend the Internal Revenue Code of 1986 to provide tax relief for small business, and for other purposes." 

正如您所看到的,虽然在上面的示例中“支持交易”可能不会如此,但该函数已检索到一组与查询相关的文档。该函数旨在检索具有语义链接而非精确匹配的文档。

我们还可以通过绘制余弦距离来查看这些文档如何“关闭”到查询向量:

plot(1:l,SemanticLink(text,exp,lsaSpace,n=l,Out="ValuesSorted") 
     ,type="b",pch=16,col="blue",main=paste("Query Vector Proximity",exp,sep=" "), 
     xlab="observations",ylab="Cosine") 

我还没有足够的声誉制作情节,抱歉。

正如您所看到的,前两个条目似乎与查询向量的关联性大于其余条目(虽然大约有5个条目特别相关),即使阅读它们,您也不会这样做。我会说这是使用同义词构建查询向量的效果。然而,忽略这一点,图表允许我们有多少其他文档与查询向量远程相似。


编辑:


就在最近,我不得不解决你想要解决的问题,但上述功能不能很好地工作,仅仅是因为数据非常糟糕 - 文字很短,而且文字很少而且没有探讨过很多主题。因此,为了在这种情况下找到相关条目,我开发了另一个纯粹基于正则表达式的函数。

这就是:

HLS.Extract = function(pattern,text=active.text){


  require(qdap)
  require(tm)
  require(RTextTools)

  p = unlist(strsplit(pattern," "))
  p = unique(wordStem(p))
  p = gsub("(.*)i$","\\1y",p)

  Syns = function(word){   
    wl    =   gsub("(.*[[:space:]].*)","",      
                   gsub("^c\\(|[[:punct:]]+|^[[:space:]]+|[[:space:]]+$","",  
                        unlist(strsplit(PlainTextDocument(synonyms(word)),",")))) 
    wl = wl[wl!=""] 
    return(wl)     
  } 

  trim = function(x){

    temp_L  = nchar(x)
    if(temp_L < 5)                {N = 0}
    if(temp_L > 4 && temp_L < 8)  {N = 1}
    if(temp_L > 7 && temp_L < 10) {N = 2}
    if(temp_L > 9)                {N = 3}
    x = substr(x,0,nchar(x)-N)
    x = gsub("(.*)","\\1\\\\\\w\\*",x)

    return(x)
  }

  # SINGLE WORD SCENARIO

  if(length(p)<2){

    # EXACT
    p = trim(p)
    ndx_exact  = grep(p,text,ignore.case=T)
    text_exact = text[ndx_exact]

    # SEMANTIC
    p = unlist(strsplit(pattern," "))

    express  = new.exp = list()
    express  = c(p,Syns(p))
    p        = unique(wordStem(express))

    temp_exp = unlist(strsplit(express," "))
    temp.p = double(length(seq(temp_exp)))

    for(j in seq(temp_exp)){
      temp_exp[j] = trim(temp_exp[j])
    }

    rgxp   = paste(temp_exp,collapse="|")
    ndx_s  = grep(paste(temp_exp,collapse="|"),text,ignore.case=T,perl=T)
    text_s = as.character(text[ndx_s])

    f.object = list("ExactIndex"    = ndx_exact,
                    "SemanticIndex" = ndx_s,
                    "ExactText"     = text_exact,
                    "SemanticText"  = text_s)
  }

  # MORE THAN 2 WORDS

  if(length(p)>1){

    require(combinat)

    # EXACT
    for(j in seq(p)){p[j] = trim(p[j])}

    fp     = factorial(length(p))
    pmns   = permn(length(p))
    tmat   = matrix(0,fp,length(p))
    permut = double(fp)
    temp   = double(length(p))
    for(i in 1:fp){
      tmat[i,] = pmns[[i]]
    }

    for(i in 1:fp){
      for(j in seq(p)){
        temp[j] = paste(p[tmat[i,j]])
      }
      permut[i] = paste(temp,collapse=" ")
    }

    permut = gsub("[[:space:]]",
                  "[[:space:]]+([[:space:]]*\\\\w{,3}[[:space:]]+)*(\\\\w*[[:space:]]+)?([[:space:]]*\\\\w{,3}[[:space:]]+)*",permut)

    ndx_exact  = grep(paste(permut,collapse="|"),text)
    text_exact = as.character(text[ndx_exact])


    # SEMANTIC

    p = unlist(strsplit(pattern," "))
    express = list()
    charexp = permut = double(length(p))
    for(i in seq(p)){
      express[[i]] = c(p[i],Syns(p[i]))
      express[[i]] = unique(wordStem(express[[i]]))
      express[[i]] = gsub("(.*)i$","\\1y",express[[i]])
      for(j in seq(express[[i]])){
        express[[i]][j] = trim(express[[i]][j])
      }
      charexp[i] = paste(express[[i]],collapse="|")
    }

    charexp  = gsub("(.*)","\\(\\1\\)",charexp)
    charexpX = double(length(p))
    for(i in 1:fp){
      for(j in seq(p)){
        temp[j] = paste(charexp[tmat[i,j]])
      }
      permut[i] = paste(temp,collapse=
                          "[[:space:]]+([[:space:]]*\\w{,3}[[:space:]]+)*(\\w*[[:space:]]+)?([[:space:]]*\\w{,3}[[:space:]]+)*")
    }
    rgxp   = paste(permut,collapse="|")
    ndx_s  = grep(rgxp,text,ignore.case=T)
    text_s = as.character(text[ndx_s])

    temp.f = function(x){
      if(length(x)==0){x=0}
    }

    temp.f(ndx_exact);  temp.f(ndx_s)
    temp.f(text_exact); temp.f(text_s)

    f.object = list("ExactIndex"    = ndx_exact,
                    "SemanticIndex" = ndx_s,
                    "ExactText"     = text_exact,
                    "SemanticText"  = text_s,
                    "Synset"        = express)

  }
  return(f.object)
  cat(paste("Exact Matches:",length(ndx_exact),sep=""))
  cat(paste("\n"))
  cat(paste("Semantic Matches:",length(ndx_s),sep=""))
}

尝试一下:

HLS.Extract("buy house",
            c("we bought a new house",
              "I'm thinking about buying a new home",
              "purchasing a brand new house"))[["SemanticText"]]

$SemanticText
[1] "I'm thinking about buying a new home" "purchasing a brand new house"

如您所见,该功能非常灵活。它也会选择“购房”。它没有拿起“我们买了一所新房子”,因为“买”是一个不规则的动词 - 这是LSA会选择的那种东西。

所以你可能想尝试两者,看看哪一个效果更好。 SemanticLink函数也需要大量内存,当你有一个特别大的语料库时,你将无法使用它

干杯

答案 1 :(得分:0)

我建议你阅读this问题的答案,特别是前两个答案真的很好 我还可以推荐Natural language processing toolkit(未亲自尝试过)

答案 2 :(得分:0)

对于新闻文章之间的相似性,您可以使用词性标注来提取关键字。 NLTK提供了一个很好的POS tagger。使用名词和名词短语作为关键词,将每篇新闻文章表示为关键词矢量。

然后使用cosine similarity或一些此类文本相似性度量来量化相似性。

进一步增强包括处理同义词,词干,必要时处理形容词,使用TF-IDF作为向量中的关键词权重等。