我想结合PostgreSQL Levenshtein和trigram相似度函数。 trigram相似度函数的主要优点是它可以利用GIN或GIST索引,因此可以快速返回模糊匹配结果。但是,如果在另一个函数内部调用它,则它不使用索引。 为了解决这个问题,这里有一个plpgsql函数“trigram_similarity”,它调用了原始的trigram的“相似性”函数。
CREATE OR REPLACE FUNCTION public.trigram_similarity(
left_string text,
right_string text)
RETURNS real AS
$BODY$
BEGIN
RETURN similarity(left_string, right_string);
END;$BODY$
LANGUAGE plpgsql IMMUTABLE STRICT
COST 100;
ALTER FUNCTION public.trigram_similarity(text, text)
OWNER TO postgres;
虽然函数实际上只是调用了trigram的相似性函数,但是当涉及GIN索引利用时,它的行为完全不同。虽然查询的WHERE子句中的原始trigram的相似性函数确实利用了GIN索引,因此查询快速返回结果并且没有太多RAM消耗,但是当使用trigram_similarity时它不会。对于大型数据集模糊匹配分析,RAM完全使用,应用程序冻结... 为了便于说明,这是一个示例查询:
SELECT DISTINCT
trigram_similarity(l.l_composite_18, r.r_composite_18)
::numeric(5,4) AS trigram_similarity_composite_score
, (trigram_similarity(l."Name", r."Name")*(0.166666666666667)
+ trigram_similarity(l."LastName", r."Surname")*(0.0833333333333333)
+ trigram_similarity(l."County", r."District")*(0.0416666666666667)
+ trigram_similarity(l."Town", r."Location")*(0.0416666666666667)
+ trigram_similarity(l."PostalCode", r."ZipCode")*(0.0416666666666667)
+ trigram_similarity(l."PostOffice", r."PostOffice")*(0.0416666666666667)
+ trigram_similarity(l."Street", r."Road")*(0.0416666666666667)
+ trigram_similarity(l."Number", r."HomeNumber")*(0.0416666666666667)
+ trigram_similarity(l."Telephone1", r."Phone1")*(0.0416666666666667)
+ trigram_similarity(l."Telephone2", r."Phone2")*(0.0416666666666667)
+ trigram_similarity(l."EMail", r."EMail")*(0.0416666666666667)
+ trigram_similarity(l."BirthDate", r."DateOfBirth")*(0.166666666666667)
+ trigram_similarity(l."Gender", r."Sex")*(0.208333333333333)
)
::numeric(5,4) AS trigram_similarity_weighted_score
, l."ClanID" AS "l_ClanID_1"
, l."Name" AS "l_Name_2"
, l."LastName" AS "l_LastName_3"
, l."County" AS "l_County_4"
, l."Town" AS "l_Town_5"
, l."PostalCode" AS "l_PostalCode_6"
, l."PostOffice" AS "l_PostOffice_7"
, l."Street" AS "l_Street_8"
, l."Number" AS "l_Number_9"
, l."Telephone1" AS "l_Telephone1_10"
, l."Telephone2" AS "l_Telephone2_11"
, l."EMail" AS "l_EMail_12"
, l."BirthDate" AS "l_BirthDate_13"
, l."Gender" AS "l_Gender_14"
, l."Aktivan" AS "l_Aktivan_15"
, l."ProgramCode" AS "l_ProgramCode_16"
, l."Card" AS "l_Card_17"
, l."DateOfCreation" AS "l_DateOfCreation_18"
, l."Assigned" AS "l_Assigned_19"
, l."Reserved" AS "l_Reserved_20"
, l."Sent" AS "l_Sent_21"
, l."MemberOfBothPrograms" AS "l_MemberOfBothPrograms_22"
, r."ClanID" AS "r_ClanID_23"
, r."Name" AS "r_Name_24"
, r."Surname" AS "r_Surname_25"
, r."District" AS "r_District_26"
, r."Location" AS "r_Location_27"
, r."ZipCode" AS "r_ZipCode_28"
, r."PostOffice" AS "r_PostOffice_29"
, r."Road" AS "r_Road_30"
, r."HomeNumber" AS "r_HomeNumber_31"
, r."Phone1" AS "r_Phone1_32"
, r."Phone2" AS "r_Phone2_33"
, r."EMail" AS "r_EMail_34"
, r."DateOfBirth" AS "r_DateOfBirth_35"
, r."Sex" AS "r_Sex_36"
, r."Active" AS "r_Active_37"
, r."ProgramCode" AS "r_ProgramCode_38"
, r."CardNo" AS "r_CardNo_39"
, r."CreationDate" AS "r_CreationDate_40"
, r."Assigned" AS "r_Assigned_41"
, r."Reserved" AS "r_Reserved_42"
, r."Sent" AS "r_Sent_43"
, r."BothPrograms" AS "r_BothPrograms_44"
FROM "l_leftdatasetexample3214274191" AS l, "r_rightdatasetexample3214274191" AS r
WHERE
((l."Gender"=r."Sex") AND (l."Card"<>r."CardNo") AND (l."Name"=r."Name"))
AND
((l."ProgramCode"= '1') AND (r."ProgramCode"= '1'))
AND
(((l.l_composite_18 % r.r_composite_18)
)
OR (((l."Name" % r."Name")
OR (l."LastName" % r."Surname")
OR (l."County" % r."District")
OR (l."Town" % r."Location")
OR (l."PostalCode" % r."ZipCode")
OR (l."PostOffice" % r."PostOffice")
OR (l."Street" % r."Road")
OR (l."Number" % r."HomeNumber")
OR (l."Telephone1" % r."Phone1")
OR (l."Telephone2" % r."Phone2")
OR (l."EMail" % r."EMail")
OR (l."BirthDate" % r."DateOfBirth")
OR (l."Gender" % r."Sex"))
)
) AND ((trigram_similarity(l.l_composite_18, r.r_composite_18)
>= 0.7)
OR ((trigram_similarity(l."Name", r."Name")*(0.166666666666667)
+ trigram_similarity(l."LastName", r."Surname")*(0.0833333333333333)
+ trigram_similarity(l."County", r."District")*(0.0416666666666667)
+ trigram_similarity(l."Town", r."Location")*(0.0416666666666667)
+ trigram_similarity(l."PostalCode", r."ZipCode")*(0.0416666666666667)
+ trigram_similarity(l."PostOffice", r."PostOffice")*(0.0416666666666667)
+ trigram_similarity(l."Street", r."Road")*(0.0416666666666667)
+ trigram_similarity(l."Number", r."HomeNumber")*(0.0416666666666667)
+ trigram_similarity(l."Telephone1", r."Phone1")*(0.0416666666666667)
+ trigram_similarity(l."Telephone2", r."Phone2")*(0.0416666666666667)
+ trigram_similarity(l."EMail", r."EMail")*(0.0416666666666667)
+ trigram_similarity(l."BirthDate", r."DateOfBirth")*(0.166666666666667)
+ trigram_similarity(l."Gender", r."Sex")*(0.208333333333333)
)
>= 0.7)
) ORDER BY trigram_similarity_composite_score DESC;
此查询导致RAM clottage和应用程序冻结。当“trigram_similarity”替换为“相似性”时,查询执行速度快且没有RAM过度消耗。 为什么“trigram_similarity”和“相似性”表现不同? 有没有办法可以强制GIN或GIST索引利用这个“trigram_similarity”函数或任何其他函数调用trigram的相似函数?
解释分析何时使用“相似性”:
"Unique (cost=170717.94..177633.17 rows=58853 width=383) (actual time=260362.193..260362.279 rows=99 loops=1)"
" Output: ((similarity((l.l_composite_18)::text, (r.r_composite_18)::text))::numeric(5,4)), (((((((((((((((similarity((l."Name")::text, (r."Name")::text) * '0.166666666666667'::double precision) + (similarity((l."LastName")::text, (r."Surname")::text) * '0 (...)"
" Buffers: shared hit=2513871 read=4158"
" -> Sort (cost=170717.94..170865.07 rows=58853 width=383) (actual time=260362.192..260362.198 rows=99 loops=1)"
" Output: ((similarity((l.l_composite_18)::text, (r.r_composite_18)::text))::numeric(5,4)), (((((((((((((((similarity((l."Name")::text, (r."Name")::text) * '0.166666666666667'::double precision) + (similarity((l."LastName")::text, (r."Surname")::text (...)"
" Sort Key: ((similarity((l.l_composite_18)::text, (r.r_composite_18)::text))::numeric(5,4)) DESC, (((((((((((((((similarity((l."Name")::text, (r."Name")::text) * '0.166666666666667'::double precision) + (similarity((l."LastName")::text, (r."Surname" (...)"
" Sort Method: quicksort Memory: 76kB"
" Buffers: shared hit=2513871 read=4158"
" -> Nested Loop (cost=0.29..155793.36 rows=58853 width=383) (actual time=1851.503..260361.609 rows=99 loops=1)"
" Output: (similarity((l.l_composite_18)::text, (r.r_composite_18)::text))::numeric(5,4), ((((((((((((((similarity((l."Name")::text, (r."Name")::text) * '0.166666666666667'::double precision) + (similarity((l."LastName")::text, (r."Surname")::t (...)"
" Buffers: shared hit=2513871 read=4158"
" -> Seq Scan on public.r_rightdatasetexample3214274191 r (cost=0.00..11228.86 rows=101669 width=188) (actual time=9.149..67.134 rows=50837 loops=1)"
" Output: r."ClanID", r."Name", r."Surname", r."District", r."Location", r."ZipCode", r."PostOffice", r."Road", r."HomeNumber", r."Phone1", r."Phone2", r."EMail", r."DateOfBirth", r."Sex", r."Active", r."ProgramCode", r."CardNo", r."Creat (...)"
" Filter: ((r."ProgramCode")::text = '1'::text)"
" Buffers: shared hit=5800 read=4158"
" -> Index Scan using "idxbNameA8D72F00099E4B70885B2E0BB1DFB684l_leftdatasetexample321" on public.l_leftdatasetexample3214274191 l (cost=0.29..1.35 rows=1 width=195) (actual time=5.111..5.119 rows=0 loops=50837)"
" Output: l."ClanID", l."Name", l."LastName", l."County", l."Town", l."PostalCode", l."PostOffice", l."Street", l."Number", l."Telephone1", l."Telephone2", l."EMail", l."BirthDate", l."Gender", l."Aktivan", l."ProgramCode", l."Card", l."D (...)"
" Index Cond: ((l."Name")::text = (r."Name")::text)"
" Filter: (((l."ProgramCode")::text = '1'::text) AND ((l."Card")::text <> (r."CardNo")::text) AND ((r."Sex")::text = (l."Gender")::text) AND (((l.l_composite_18)::text % (r.r_composite_18)::text) OR ((l."Name")::text % (r."Name")::text) O (...)"
" Rows Removed by Filter: 50"
" Buffers: shared hit=2508071"
"Planning time: 13.885 ms"
"Execution time: 260362.730 ms"
答案 0 :(得分:1)
在表列上创建索引。您需要修改plpgsql函数以查询GIN或GIST索引的表列,而不是比较两个字符串文字。如果你比较两个字符串文字,那么插件没有要点击的索引,必须在比较它们之前将两个字符串分解为三元组,这是你的问题。
答案 1 :(得分:0)
可以在复合(连接)表达式上创建GIN trgm_ops表达式索引。该指数可以提供%相似度算子,但不能提高相似度函数。