在R中复制Postgres pg_trgm文本相似度得分?

时间:2014-07-21 00:10:37

标签: r postgresql text similarity text-mining

有没有人知道如何从R中的相似性(文本,文本)函数复制(pg_trgm)postgres trigram相似性得分?我正在使用stringdist包,而宁愿使用R来计算.csv文件中的文本字符串矩阵,而不是运行一堆postgresql quires。

在postgres中运行similarity(string1, string2)会给我一个0到1之间的数字分数。

我厌倦了使用stringdist包获得分数,但我认为我仍然需要将代码划分为以下内容。

stringdist(string1, string2, method="qgram",q = 3 )

有没有办法用stringdist包复制pg_trgm得分或用R做另外一种方法?

一个例子是获得书籍描述与科幻小说类型描述之间的相似性得分。例如,如果我有两本书的描述和使用相似性分数

book 1 = "Area X has been cut off from the rest of the continent for decades. Nature has reclaimed the last vestiges of human civilization. The first expedition returned with reports of a pristine, Edenic landscape; the second expedition ended in mass suicide, the third expedition in a hail of gunfire as its members turned on one another. The members of the eleventh expedition returned as shadows of their former selves, and within weeks, all had died of cancer. In Annihilation, the first volume of Jeff VanderMeer's Southern Reach trilogy, we join the twelfth expedition.
     The group is made up of four women: an anthropologist; a surveyor; a psychologist, the de facto leader; and our narrator, a biologist. Their mission is to map the terrain, record all observations of their surroundings and of one anotioner, and, above all, avoid being contaminated by Area X itself.
     They arrive expecting the unexpected, and Area X delivers—they discover a massive topographic anomaly and life forms that surpass understanding—but it’s the surprises that came across the border with them and the secrets the expedition members are keeping from one another that change everything."

book 2= "From Wall Street to Main Street, John Brooks, longtime contributor to the New Yorker, brings to life in vivid fashion twelve classic and timeless tales of corporate and financial life in America
     What do the $350 million Ford Motor Company disaster known as the Edsel, the fast and incredible rise of Xerox, and the unbelievable scandals at GE and Texas Gulf Sulphur have in common? Each is an example of how an iconic company was defined by a particular moment of fame or notoriety; these notable and fascinating accounts are as relevant today to understanding the intricacies of corporate life as they were when the events happened.
     Stories about Wall Street are infused with drama and adventure and reveal the machinations and volatile nature of the world of finance. John Brooks’s insightful reportage is so full of personality and critical detail that whether he is looking at the astounding market crash of 1962, the collapse of a well-known brokerage firm, or the bold attempt by American bankers to save the British pound, one gets the sense that history repeats itself.
     Five additional stories on equally fascinating subjects round out this wonderful collection that will both entertain and inform readers . . . Business Adventures is truly financial journalism at its liveliest and best."

genre 1 = "Science fiction is a genre of fiction dealing with imaginative content such as futuristic settings, futuristic science and technology, space travel, time travel, faster than light travel, parallel universes, and extraterrestrial life. It often explores the potential consequences of scientific and other innovations, and has been called a "literature of ideas".[1] Authors commonly use science fiction as a framework to explore politics, identity, desire, morality, social structure, and other literary themes."

如何使用R脚本与pg_trgm等科幻小说类型的描述相比较,获得每本书描述的相似性得分?

1 个答案:

答案 0 :(得分:0)

这样的事情怎么样?

library(textcat)

?textcat_xdist
# Compute cross-distances between collections of n-gram profiles.

round(textcat_xdist(
  list(
       text1="hello there",
       text2="why hello there",
       text3="totally different"
       ),
     method="cosine"),
3)

#      text1 text2 text3
#text1 0.000 0.078 0.731
#text2 0.078 0.000 0.739
#text3 0.731 0.739 0.000