匹配至少有一个共同字的字符串

时间:2013-11-25 21:29:56

标签: java rdf sparql jena

我正在进行查询以获取具有特定标题的文档的URI。我的疑问是:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dc: <http://purl.org/dc/elements/1.1/> SELECT ?document WHERE {
  ?document dc:title ?title.
  FILTER (?title = "…" ).
}

其中"…"实际上是this.getTitle()的值,因为查询字符串是由:

生成的
String queryString = "PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> " +
                "PREFIX dc: <http://purl.org/dc/elements/1.1/> SELECT ?document WHERE { " +
                "?document dc:title ?title." +
                "FILTER (?title = \"" + this.getTitle() + "\" ). }";

通过上面的查询,我只得到标题与this.getTitle()完全相同的文档。想象一下this.getTitle由一个以上的单词组成。即使文档标题中只出现一个形成this.getTitle的单词(例如),我也想获取文档。我怎么能这样做?

1 个答案:

答案 0 :(得分:3)

假设您有一些像(在Turtle中)的数据:

@prefix : <http://stackoverflow.com/q/20203733/1281433> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .

:a dc:title "Great Gatsby" .
:b dc:title "Boring Gatsby" .
:c dc:title "Great Expectations" .
:d dc:title "The Great Muppet Caper" .

然后您可以使用如下查询:

prefix : <http://stackoverflow.com/q/20203733/1281433>
prefix dc: <http://purl.org/dc/elements/1.1/>

select ?x ?title where {
  # this is just in place of this.getTitle().  It provides a value for
  # ?TITLE that is "Gatsby Strikes Again".
  values ?TITLE { "Gatsby Strikes Again" }

  # Select a thing and its title.
  ?x dc:title ?title .

  # Then filter based on whether the ?title matches the result
  # of replacing the strings in ?TITLE with "|", and matching
  # case insensitively.
  filter( regex( ?title, replace( ?TITLE, " ", "|" ), "i" ))
}

获得类似

的结果
------------------------
| x  | title           |
========================
| :b | "Boring Gatsby" |
| :a | "Great Gatsby"  |
------------------------

关于这一点特别简洁的是,因为你正在动态生成模式,你甚至可以根据图形模式中的另一个值来制作模式。例如,如果你想要所有标题匹配至少一个单词的东西,你可以这样做:

prefix : <http://stackoverflow.com/q/20203733/1281433>
prefix dc: <http://purl.org/dc/elements/1.1/>

select ?x ?xtitle ?y ?ytitle where {
  ?x dc:title ?xtitle .
  ?y dc:title ?ytitle .
  filter( regex( ?xtitle, replace( ?ytitle, " ", "|" ), "i" ) && ?x != ?y )
}
order by ?x ?y

得到:

-----------------------------------------------------------------
| x  | xtitle                   | y  | ytitle                   |
=================================================================
| :a | "Great Gatsby"           | :b | "Boring Gatsby"          |
| :a | "Great Gatsby"           | :c | "Great Expectations"     |
| :a | "Great Gatsby"           | :d | "The Great Muppet Caper" |
| :b | "Boring Gatsby"          | :a | "Great Gatsby"           |
| :c | "Great Expectations"     | :a | "Great Gatsby"           |
| :c | "Great Expectations"     | :d | "The Great Muppet Caper" |
| :d | "The Great Muppet Caper" | :a | "Great Gatsby"           |
| :d | "The Great Muppet Caper" | :c | "Great Expectations"     |
-----------------------------------------------------------------

当然,非常重要的是要注意你现在正在根据你的数据生成模式,这意味着可以将数据放入你的系统的人可以将非常昂贵的模式放入陷入查询并导致拒绝服务。在一个更平凡的笔记中,如果你的任何标题中包含会干扰正则表达式的字符,你可能会遇到麻烦。一个有趣的问题是,如果某个东西有多个空格的标题,那么该模式就会变成The|Words|With||Two|Spaces,因为那里的空模式可能会使所有匹配。这是一个有趣的方法,但它有一个很多的警告。

通常,您可以按此处所示执行此操作,或者通过在代码中生成正则表达式(您可以在其中处理转义等),或者您可以使用支持某些基于文本的扩展的SPARQL引擎(例如,jena-text,它将Apache Lucene或Apache Solr添加到Apache Jena中。