Question

我想配置这样的东西：

关于书籍的元数据的RDF数据集;
书籍分开放置，如XHTML文件，带有唯一ID的段落;
每本书的元数据都包含dc:source链接到文件的内容（绝对的？像正确的URI，缩放怎么样？）;

我知道这可能是微不足道的，但我无法正确理解。在开始时，我试图仅索引纯TXT微小文件，每个文件都链接自元数据文件中的dc:source。据我了解，这应该足以索引包含的所有内容。我试图像this post中的那个人那样做。与他不同，我想索引RDF数据集以及外部文件。特别是这两个命令没有记录错误（相反，它记录有57个三元组）：

java -cp /home/honza/.apache-jena-fuseki-2.3.0/fuseki-server.jar tdb.tdbloader --tdb=run/configuration/service2.ttl testDir/test_dataset.ttl

INFO  -- Start triples data phase
INFO  ** Load into triples table with existing data
INFO  -- Start quads data phase
INFO  ** Load empty quads table
INFO  Load: testDir/test_dataset.ttl -- 2015/11/13 12:46:22 CET
INFO  -- Finish triples data phase
INFO  ** Data: 57 triples loaded in 0,29 seconds [Rate: 193,22 per second]
INFO  -- Finish quads data phase
INFO  -- Start triples index phase
INFO  -- Finish triples index phase
INFO  -- Finish triples load
INFO  ** Completed: 57 triples loaded in 0,33 seconds [Rate: 172,21 per second]
INFO  -- Finish quads load

和

java -cp /home/honza/.apache-jena-fuseki-2.3.0/fuseki-server.jar jena.textindexer --desc=run/configuration/service2.ttl

WARN  Values stored but langField not set. Returned values will not have language tag or datatype.

之后，服务器正常运行，我看到了图表，但它没有数据。

我对此服务的配置是（我不知道将服务和数据库配置放在一个文件中是否正确，对我而言，目前效果更好，除了抛出一些错误）：

@prefix fuseki:  <http://jena.apache.org/fuseki#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix text:    <http://jena.apache.org/text#> .
@prefix :        <#> .

[] rdf:type fuseki:Server 
.

<#service2> rdf:type fuseki:Service ;
  rdfs:label                        "TDB/text service" ;
  fuseki:name                       "test" ;       # http://host:port/ds
  fuseki:serviceQuery               "sparql" ;   # SPARQL query service
  fuseki:serviceQuery               "query" ;    # SPARQL query service (alt name)
  fuseki:serviceUpdate              "update" ;   # SPARQL update service
  fuseki:serviceUpload              "upload" ;   # Non-SPARQL upload service
  fuseki:serviceReadWriteGraphStore "data" ;     # SPARQL Graph store protocol (read and write)
  # A separate read-only graph store endpoint:
  fuseki:serviceReadGraphStore      "get" ;      # SPARQL Graph store protocol (read only)
  fuseki:dataset                    :text_dataset 
.

[] ja:loadClass   "org.apache.jena.tdb.TDB" .
tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb:GraphTDB    rdfs:subClassOf  ja:Model .

[] ja:loadClass "org.apache.jena.query.text.TextQuery" .

text:TextIndexLucene rdfs:subClassOf  text:TextIndex .
:text_dataset rdf:type text:TextDataset ;
  text:dataset <#test> ;
  text:index <#indexLucene> .

Answer 1

首先，您实际上没有明确定义Lucene索引，因此您可能得到的是瞬态内存中索引，每次应用程序停止时都会丢弃该索引。您至少需要在配置中使用以下内容：

# Text index description
<#indexLucene> a text:TextIndexLucene ;
    text:directory <file:path/to/index/> .

<file:path/to/index/>指向您希望存储文本索引的目录。

其次，您还没有告诉文本搜索Lucene索引的结构。即使您已从外部文件中单独创建索引，也需要在配置中定义Jena应如何使用和访问该索引。

从documentation您需要定义实体地图：

# Mapping in the index
# URI stored in field "uri"
# rdfs:label is mapped to field "text"
<#entMap> a text:EntityMap ;
    text:entityField      "uri" ;
    text:defaultField     "text" ;
    text:map (
         [ text:field "text" ; text:predicate rdfs:label ]
         ) .

文档中的示例中的注释有望很好地描述事物。 text:entityField属性用于指定索引中存储与索引数据关联的URI的字段，即这提供了将文本索引命中链接回三元组中的RDF的方法。 text:defaultField用于指定包含索引数据的字段，即文本搜索将实际搜索的字段。

此处显示的可选text:map可用于进一步自定义搜索的字段，并允许您在不同字段中索引多条内容，然后编写以不同方式搜索文本索引的查询。

一旦你有一个适当定义的实体地图，你需要将它链接到索引配置，如下所示：

# Text index description
<#indexLucene> a text:TextIndexLucene ;
    text:directory <file:path/to/index/> ;
    text:entityMap <#entMap> .

有了这个，您实际上应该可以从索引中获得结果。

Apache Jena全文搜索（包含外部内容）

1 个答案: