我有一张电影表,我想搜索标题并返回最接近的匹配。
我认为全文搜索可能有用,但它似乎无法通过单词的位置排序,尽管postgres知道位置。这有可能在postgres?
这是我的问题:
SELECT collectibles.id, collectibles.title, ts_rank_cd(to_tsvector('english', collectibles.title), plainto_tsquery('old school')) AS score
FROM collectibles WHERE to_tsvector('english', collectibles.title) @@ plainto_tsquery('old school')
ORDER BY score DESC;
以下是一些结果:(这是我能看出来的最好的格式,对不起!)
id | title | score
- 277568 | Wilson Meadows: Live At The 15th Old School & Blues Festival | 0.1
- 3545 | 5 Film Collection: Will Ferrell: Campaign / Old School (Unrtated Version) / Blades Of Glory / Roxbury / Semi-Pro | 0.1
- 10366 | Alice Cooper: Old School: 1964-1974 (DVD/CD Combo) | 0.1
- 13004 | American Classics: Old School (3-Disc Set) | 0.1
- 13005 | American Classics: Old School: Classic Chevrolets | 0.1
- 13006 | American Classics: Old School: Classic Travel Trailers | 0.1
- 13007 | American Classics: Old School: Kings Of Kustomizing | 0.1
- 14592 | Anchorman: The Legend Of Ron Burgundy (Widescreen/ Extended Edition) / Old School (R-Rated Version) (Back-To-Back) | 0.1
- 14593 | Anchorman: The Legend Of Ron Burgundy (Widescreen/ Extended Edition) / Old School (R-Rated Version) (Side-By-Side) | 0.1
- 20242 | Audiovisualize: Mixed By Addictive TV: Snake Worship Island / Corp. Inc. / Old School Futures / These Melodies / Robot War / ... | 0.1
- 192057 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) | 0.1
- 192058 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) / Road Trip (R-Rated) (Back-To-Back) | 0.1
- 192059 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) / Road Trip (R-Rated) (Side-By-Side) | 0.1
- 192060 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) / Road Trip (Unrated) (Back-To-Back) | 0.1
- 192061 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) / Road Trip (Unrated) (Side-By-Side) | 0.1
- 192062 | Old School (Warner Brothers/ R-Rated Version) | 0.1
- 192063 | Old School (Warner Brothers/ R-Rated Version/ Blu-ray) | 0.1
- 192064 | Old School (Warner Brothers/ Unrated Version) | 0.1
- 192065 | Old School (Warner Brothers/ Unrated Version/ Blu-ray) | 0.1
- 192066 | Old School Comedy (4-Pack): Atoll K / Jack And The Beanstalk / The Flying Deuces / Africa Screams | 0.1
- 192067 | Old School Hip Hop Dance #1: Beginner | 0.1
- 192068 | Old School Hip Hop Greatest | 0.1
- 192069 | Old School Hip Hop: Run DMC & Flava Flav (2-Disc) | 0.1
- 192070 | Old School Hits Movie Marathon Collection (3-Disc) | 0.1
- 192071 | Old School Returns | 0.1
所有这些的分数是0.1,但许多标题中单词的位置更接近字符串的前面。有没有办法将这些排名更高?不幸的是,字符串或id的长度并不是排名很好的限定符。
答案 0 :(得分:1)
在这里,您需要对ts_rank(tsvector,tsquery,normalization factor)
函数使用规范化。在下面的代码片段中,我使用了let str = "t{he${cat${sat${on${the${mat"
let splitBy = "${"
extension String {
func split(splitBy: String)->[String] {
if self.isEmpty { return [] }
var arr:[String] = []
var tmp = self
var tmp1 = ""
var i = self.startIndex
let e = self.endIndex
let c = splitBy.characters.count
while i < e {
let tag = tmp.hasPrefix(splitBy)
if !tag {
tmp1.append(tmp.removeAtIndex(tmp.startIndex))
i = i.successor()
} else {
tmp.removeRange(Range(start: tmp.startIndex, end: tmp.startIndex.advancedBy(c)))
i = i.advancedBy(c)
arr.append(tmp1)
tmp1 = ""
}
}
arr.append(tmp1)
return arr.filter{ !$0.isEmpty }
}
}
let arr = str.split(splitBy) // ["t{he", "cat", "sat", "on", "the", "mat"]
= normalization
(将等级除以1 +文档长度的对数),但您可以将其调整为您真正需要的值。这是一个例子:
1
结果:
WITH s(id,tsv) AS ( VALUES
(1,to_tsvector('english','Alice Cooper: Old School: 1964-1974 (DVD/CD Combo)')),
(2,to_tsvector('english','American Classics: Old School: Kings Of Kustomizing')),
(3,to_tsvector('english','Old School Hip Hop Greatest')),
(4,to_tsvector('english','Old School Returns'))
)
SELECT id,ts_rank(tsv,tsq,1) AS rank
FROM s,to_tsquery('english','old & school') tsq
ORDER BY rank DESC;
答案 1 :(得分:1)
很老的问题,但是:
您可以使用 ts_rank_cd()
来考虑词素(关键字)之间的距离。 (我不知道这是如何完成的)
您还可以将第 4 位传递给归一化整数(它是位掩码)以将排名除以 the mean harmonic distance between extents
(使用 ts_rank_cd)
我没有过多关注这个,但希望这是一个起点
答案 2 :(得分:0)
此外,*可以附加到词位以指定前缀匹配
与
to_tsquery也可以接受单引号短语
你可以这样做:
SELECT to_tsquery('''old school'':*');
to_tsquery
----------------------
'old':* & 'school':*
(1 row)
所以你的情况会是这样的:
SELECT
collectibles.id,
collectibles.title,
ts_rank_cd(
to_tsvector('english', collectibles.title),
to_tsquery('''old school'':*')
) AS score
FROM collectibles
WHERE to_tsvector('english', collectibles.title) @@ to_tsquery('''old school'':*')
ORDER BY score DESC;
答案 3 :(得分:0)
我设法通过拆分各个部分、获取第一个单词并将其设置为更高的优先级 (A) 来实现这一点:
setweight(to_tsvector('english', split_part(coalesce("title", ''), ' ', 1) ), 'A') ||
setweight(to_tsvector('english', coalesce("title", '')), 'B')