使用iSPARQL使用相似性度量来比较值

时间:2014-07-03 14:53:05

标签: sparql semantic-web similarity dbpedia linked-data

我有一个问题。

我想编写一个查询,检索相似的值(给定相似函数,如Lev)到给定字符串“Londn”,以便与谓词“RDFS:label”进行比较。 DBpedia中。例如,在输出中,我想获得“伦敦”的价值。 我已经读过一个可用的方法可能是使用iSPARQL(“不精确的SPARQL”),尽管它在文献中没有被广泛使用。

我可以使用iSPARQL还是有一些SPARQL方法来执行相同的操作?

1 个答案:

答案 0 :(得分:5)

简短版本 - 您可以在纯SPARQL

中执行此操作

您可以使用这样的查询来查找名称类似于" Londn"的城市,并按相似性(一种衡量标准)排序。答案的其余部分解释了 的工作原理:

select ?city ?percent where {
  ?city a dbpedia-owl:City ;
        rdfs:label ?label .
  filter langMatches( lang(?label), 'en' )

  bind( replace( concat( 'x', str(?label) ), "^x[^Londn]*([L]?)[^ondn]*([o]?)[^ndn]*([n]?)[^dn]*([d]?)[^n]*([n]?).*$", '$1$2$3$4$5' ) as ?match )
  bind( xsd:float(strlen(?match))/strlen(str(?label)) as ?percent )
}
order by desc(?percent)
limit 100

SPARQL results

city                                  percent
----------------------------------------------
http://dbpedia.org/resource/London    0.833333
http://dbpedia.org/resource/Bonn      0.75
http://dbpedia.org/resource/Loudi     0.6
http://dbpedia.org/resource/Ladnu     0.6
http://dbpedia.org/resource/Lonar     0.6
http://dbpedia.org/resource/Longnan   0.571429
http://dbpedia.org/resource/Longyan   0.571429
http://dbpedia.org/resource/Luoding   0.571429
http://dbpedia.org/resource/Lodhran   0.571429
http://dbpedia.org/resource/Lom%C3%A9 0.5
http://dbpedia.org/resource/Andong    0.5

计算字符串相似性度量

  

注意:本部分答案中的代码适用于Apache Jena。实际上是一个边缘情况导致这在Virtuoso中(正确地)失败。最后的更新解决了这个问题。

SPARQL中没有内置计算字符串匹配距离,但您可以使用SPARQL中的正则表达式替换机制来完成其中的一些操作。假设你想匹配序列" cat"在某些字符串中。然后你可以使用这样的查询来计算序列中存在多少给定字符串" cat":

select ?string ?match where {
  values ?string { "cart" "concatenate" "hat" "pot" "hop" }
  bind( replace( ?string, "^[^cat]*([c]?)[^at]*([a]?)[^t]*([t]?).*$", "$1$2$3" ) as ?match )
}
-------------------------
| string        | match |
=========================
| "cart"        | "cat" |
| "concatenate" | "cat" |
| "hat"         | "at"  |
| "pot"         | "t"   |
| "hop"         | ""    |
-------------------------

通过检查字符串和匹配的长度,您应该能够计算一些不同的相似性度量。作为使用" Londn"的一个更复杂的例子。输入你提到的。百分比列是与输入匹配的字符串的百分比。

select ?input
       ?string
       (strlen(?match)/strlen(?string) as ?percent)
where {
  values ?string { "London" "Londn" "London Fog" "Lando" "Land Ho!"
                   "concatenate" "catnap" "hat" "cat" "chat" "chart" "port" "part" }

  values (?input ?pattern ?replacement) {
    ("cat"   "^[^cat]*([c]?)[^at]*([a]?)[^t]*([t]?).*$"                              "$1$2$3")
    ("Londn" "^[^Londn]*([L]?)[^ondn]*([o]?)[^ndn]*([n]?)[^dn]*([d]?)[^n]*([n]?).*$" "$1$2$3$4$5")
  }

  bind( replace( ?string, ?pattern, ?replacement) as ?match )
}
order by ?pattern desc(?percent)
--------------------------------------------------------
| input   | string        | percent                    |
========================================================
| "Londn" | "Londn"       | 1.0                        |
| "Londn" | "London"      | 0.833333333333333333333333 |
| "Londn" | "Lando"       | 0.6                        |
| "Londn" | "London Fog"  | 0.5                        |
| "Londn" | "Land Ho!"    | 0.375                      |
| "Londn" | "concatenate" | 0.272727272727272727272727 |
| "Londn" | "port"        | 0.25                       |
| "Londn" | "catnap"      | 0.166666666666666666666666 |
| "Londn" | "cat"         | 0.0                        |
| "Londn" | "chart"       | 0.0                        |
| "Londn" | "chat"        | 0.0                        |
| "Londn" | "hat"         | 0.0                        |
| "Londn" | "part"        | 0.0                        |
| "cat"   | "cat"         | 1.0                        |
| "cat"   | "chat"        | 0.75                       |
| "cat"   | "hat"         | 0.666666666666666666666666 |
| "cat"   | "chart"       | 0.6                        |
| "cat"   | "part"        | 0.5                        |
| "cat"   | "catnap"      | 0.5                        |
| "cat"   | "concatenate" | 0.272727272727272727272727 |
| "cat"   | "port"        | 0.25                       |
| "cat"   | "Lando"       | 0.2                        |
| "cat"   | "Land Ho!"    | 0.125                      |
| "cat"   | "Londn"       | 0.0                        |
| "cat"   | "London"      | 0.0                        |
| "cat"   | "London Fog"  | 0.0                        |
--------------------------------------------------------

更新

上面的代码在Apache Jena中有效,但在Virtuoso中失败,因为模式可以匹配空字符串。例如,如果您在DBpedia的端点(由Virtuoso提供支持)上尝试以下查询,您将收到以下错误:

select (replace( "foo", ".*", "x" ) as ?bar) where {}
  

Virtuoso 22023错误基于正则表达式的XPATH / XQuery / SPARQL替换()   函数无法搜索即使在一个中也可以找到的模式   空字符串

这令我感到惊讶,但replace的规范说它基于XPath fn:replace。 fn:replace的文档说:

  

如果模式匹配零长度,则会引发错误[err:FORX0003]   字符串,即表达式fn:匹配("",$ pattern,$ flags)   返回true。但是,如果捕获的子字符串是,则不是错误   零长度。

但是,我们可以通过在模式和字符串的开头添加一个字符来解决这个问题:

select ?input
       ?string
       (strlen(?match)/strlen(?string) as ?percent)
where {
  values ?string { "London" "Londn" "London Fog" "Lando" "Land Ho!"
                   "concatenate" "catnap" "hat" "cat" "chat" "chart" "port" "part" }

  values (?input ?pattern ?replacement) {
    ("cat"   "^x[^cat]*([c]?)[^at]*([a]?)[^t]*([t]?).*$"                              "$1$2$3")
    ("Londn" "^x[^Londn]*([L]?)[^ondn]*([o]?)[^ndn]*([n]?)[^dn]*([d]?)[^n]*([n]?).*$" "$1$2$3$4$5")
  }

  bind( replace( concat('x',?string), ?pattern, ?replacement) as ?match )
}
order by ?pattern desc(?percent)
--------------------------------------------------------
| input   | string        | percent                    |
========================================================
| "Londn" | "Londn"       | 1.0                        |
| "Londn" | "London"      | 0.833333333333333333333333 |
| "Londn" | "Lando"       | 0.6                        |
| "Londn" | "London Fog"  | 0.5                        |
| "Londn" | "Land Ho!"    | 0.375                      |
| "Londn" | "concatenate" | 0.272727272727272727272727 |
| "Londn" | "port"        | 0.25                       |
| "Londn" | "catnap"      | 0.166666666666666666666666 |
| "Londn" | "cat"         | 0.0                        |
| "Londn" | "chart"       | 0.0                        |
| "Londn" | "chat"        | 0.0                        |
| "Londn" | "hat"         | 0.0                        |
| "Londn" | "part"        | 0.0                        |
| "cat"   | "cat"         | 1.0                        |
| "cat"   | "chat"        | 0.75                       |
| "cat"   | "hat"         | 0.666666666666666666666666 |
| "cat"   | "chart"       | 0.6                        |
| "cat"   | "part"        | 0.5                        |
| "cat"   | "catnap"      | 0.5                        |
| "cat"   | "concatenate" | 0.272727272727272727272727 |
| "cat"   | "port"        | 0.25                       |
| "cat"   | "Lando"       | 0.2                        |
| "cat"   | "Land Ho!"    | 0.125                      |
| "cat"   | "Londn"       | 0.0                        |
| "cat"   | "London"      | 0.0                        |
| "cat"   | "London Fog"  | 0.0                        |
--------------------------------------------------------