我正在使用regex
SPARQL函数,并以这种方式将两个变量传递给它:
FILTER regex(?x, ?y, "i")
我想比较这两个字符串:Via de' cerretani
和via dei Cerretani
。通过提取第一个字符串的重要单词,在这种情况下通常是最后一个单词cerretani
,并检查它是否包含在第二个字符串中。如您所见,我将这两个字符串作为变量传递。我怎么能这样做?
答案 0 :(得分:2)
起初,我认为这与您之前的问题Comparing two strings with SPARQL重复,但这是在询问返回编辑距离的函数。这里的任务更加具体:检查字符串的最后一个字是否包含在另一个字符串中(不区分大小写)。只要我们按照您的规范
字符串的重要单词...通常是最后一个
严格且始终仅使用字符串的最后一个单词(因为通常无法确定“字符串的重要单词”是什么),我们可以这样做。但是,您最终不会使用regex
函数。相反,我们会使用replace
,contains
和lcase
(或ucase
)。
诀窍是我们可以通过使用?x
删除最后一个(以及前一个之前的空格)的所有单词来获取字符串replace
的最后一个单词,然后可以使用strcontains
检查最后一个单词是否包含在另一个字符串中。使用大小写规范化函数(在下面的代码中,我使用lcase
,但ucase
也应该有效)我们可以不敏感地执行包含检查。
select ?x ?y ?lastWordOfX ?isMatch ?isIMatch where {
# Values gives us some test data. It just means that ?x and ?y
# will be bound to the specified values. In your final query,
# these would be coming from somewhere else.
values (?x ?y) {
("Via de' cerretani" "via dei Cerretani")
("Doctor Who" "Who's on first?")
("CaT" "The cAt in the hat")
("John Doe" "Don't, John!")
}
# For "the significant word of the string which is
# usually the last one", note that the "all but the last word"
# is matched by the pattern ".* ". We can replace "all but the
# last word to leave just the last word. (Note that if the
# pattern doesn't match, then the original string is returned.
# This is good for us, because if there's just a single word,
# then it's also the last word.)
bind( replace( ?x, ".* ", "" ) as ?lastWordOfX )
# When you check whether the second string contains the first,
# you can either leave the cases as they are and have a case
# sensitive check, or you can convert them both to the same
# case and have a case insensitive match.
bind( contains( ?y, ?lastWordOfX ) as ?isMatch )
bind( contains( lcase(?y), lcase(?lastWordOfX) ) as ?isIMatch )
}
---------------------------------------------------------------------------------
| x | y | lastWordOfX | isMatch | isIMatch |
=================================================================================
| "Via de' cerretani" | "via dei Cerretani" | "cerretani" | false | true |
| "Doctor Who" | "Who's on first?" | "Who" | true | true |
| "CaT" | "The cAt in the hat" | "CaT" | false | true |
| "John Doe" | "Don't, John!" | "Doe" | false | false |
---------------------------------------------------------------------------------
这可能看起来像很多代码,但是因为有注释,并且最后一个字绑定到另一个变量,并且我包括区分大小写和不区分大小写的匹配。当你实际使用它时,它会短得多。例如,要仅选择以这种方式匹配的?x
和?y
:
select ?x ?y {
values (?x ?y) {
("Via de' cerretani" "via dei Cerretani")
("Doctor Who" "Who's on first?")
("CaT" "The cAt in the hat")
("John Doe" "Don't, John!")
}
filter( contains( lcase(?y), lcase(replace( ?x, ".* ", "" ))))
}
----------------------------------------------
| x | y |
==============================================
| "Via de' cerretani" | "via dei Cerretani" |
| "Doctor Who" | "Who's on first?" |
| "CaT" | "The cAt in the hat" |
----------------------------------------------
确实
contains( lcase(?y), lcase(replace( ?x, ".* ", "" )))
比
更长正则表达式(?x,?y,“some-special-flag”)
但我觉得它很短。如果你愿意使用?x
的最后一个单词作为正则表达式(这可能不是一个好主意,因为你不知道它不包含特殊的正则表达式字符)你甚至可以使用方法:
regex( replace( ?x, ".* ", "" ), ?y, "i" )
但我怀疑使用contains
可能会更快,因为regex
还有更多要检查的内容。