DBPedia本地服务器为不同的查询提供奇怪的结果

时间:2014-10-31 20:17:03

标签: rdf sparql semantic-web dbpedia virtuoso

我正在尝试获取所有维基百科人员的列表,其中包含尽可能多的功能以解决某些机器学习问题。

我已经设置了一个本地DBPedia服务器,并且已经增加了各种参数的限制,但不知怎的,我仍然无法获得所需的结果。

所需输出为以下格式的CSV:

<Person1>,<Feature1>,<Feature2>,<Feature3> .......... and so on
<Person2>,<Feature1>,<Feature2>,<Feature3> .......... and so on
<Person3>,<Feature1>,<Feature2>,<Feature3> .......... and so on
 ...and
 ...so
 ...on

有人可以指导我做正确的方法吗?

例如,当我运行此查询时,我得到了blank结果:

QUERY:

 SELECT  ?name ?birthDate WHERE {
   {
      SELECT strafter(str(?person),"http://dbpedia.org/resource/") as ?name,  str(?
   birthDate) as ?birthDate WHERE {
      ?person a <http://dbpedia.org/ontology/Person> .
      ?person dbpedia-owl:birthDate ?birthDate .

 }
      ORDER BY ASC(?name) 
  }
} 

 OFFSET 100000
 LIMIT 500

结果: [[名称]] [[生日]]

但是当我运行此查询时,我只获得了50000行数

QUERY:

  SELECT strafter(str(?person),"http://dbpedia.org/resource/") as ?name, str(?birthDate) 
  as ?birthDate, str(?birthName) as ?birthName, strafter(str(?
  occupation),"http://dbpedia.org/resource/") as ?occupation WHERE {
      ?person a <http://dbpedia.org/ontology/Person> .
      ?person dbpedia-owl:birthDate ?birthDate .
      ?person dbpedia-owl:birthName ?birthName .
      ?person dbpedia-owl:occupation ?occupation .

  }

结果:    &lt;&lt;&lt; 50000 rows&gt;&gt;

奇怪的是,这个查询似乎有效(至少是一个很好的数字) -

QUERY:

  select ?s ?p ?o { ?s a dbpedia-owl:Person ; ?p ?o }

结果: &LT;&LT; 1051038行&gt;&gt;

我的virtuoso.ini文件:

[Database]
DatabaseFile                    = /var/lib/virtuoso/db/virtuoso.db
ErrorLogFile                    = /var/lib/virtuoso/db/virtuoso.log
LockFile                        = /var/lib/virtuoso/db/virtuoso.lck
TransactionFile                 = /var/lib/virtuoso/db/virtuoso.trx
xa_persistent_file              = /var/lib/virtuoso/db/virtuoso.pxa
ErrorLogLevel                   = 7
FileExtend                      = 200
;MaxCheckpointRemap             = 2000
MaxCheckpointRemap              = 1362500
Striping                        = 0
TempStorage                     = TempDatabase


[TempDatabase]
DatabaseFile                    = /var/lib/virtuoso/db/virtuoso-temp.db
TransactionFile                 = /var/lib/virtuoso/db/virtuoso-temp.trx
MaxCheckpointRemap              = 2000
Striping                        = 0

[Parameters]
ServerPort                      = 1111
LiteMode                        = 0
DisableUnixSocket               = 1
DisableTcpSocket                = 0
;SSLServerPort                  = 2111
;SSLCertificate                 = cert.pem
;SSLPrivateKey                  = pk.pem
;X509ClientVerify               = 0
;X509ClientVerifyDepth          = 0
;X509ClientVerifyCAFile         = ca.pem
ServerThreads                   = 20
CheckpointInterval              = 60
O_DIRECT                        = 0
CaseMode                        = 2
MaxStaticCursorRows             = 500000000
CheckpointAuditTrail            = 0
AllowOSCalls                    = 0
SchedulerInterval               = 10
DirsAllowed                     = ., /usr/share/virtuoso/vad, /usr/local/data/datasets
ThreadCleanupInterval           = 0
ThreadThreshold                 = 10
ResourcesCleanupInterval        = 0
FreeTextBatchSize               = 100000
SingleCPU                       = 0
VADInstallDir                   = /usr/share/virtuoso/vad/
PrefixResultNames               = 0
RdfFreeTextRulesSize            = 100
IndexTreeMaps                   = 256
MaxMemPoolSize                  = 200000000
PrefixResultNames               = 0
MacSpotlight                    = 0
IndexTreeMaps                   = 64
MaxSortedTopRows                = 100000000
;;


;; Uncomment next two lines if there is 64 GB system memory free
NumberOfBuffers          = 5450000
MaxDirtyBuffers          = 4000000
;;

[HTTPServer]
ServerPort                      = 8890
ServerRoot                      = /var/lib/virtuoso/vsp
ServerThreads                   = 20
DavRoot                         = DAV
EnabledDavVSP                   = 0
HTTPProxyEnabled                = 0
TempASPXDir                     = 0
DefaultMailServer               = localhost:25
ServerThreads                   = 10
MaxKeepAlives                   = 10
KeepAliveTimeout                = 10
MaxCachedProxyConnections       = 10
ProxyConnectionCacheTimeout     = 15
HTTPThreadSize                  = 280000
HttpPrintWarningsInOutput       = 0
Charset                         = UTF-8
;HTTPLogFile                    = logs/http.log

[AutoRepair]
BadParentLinks                  = 0


[Client]
SQL_PREFETCH_ROWS               = 100
SQL_PREFETCH_BYTES              = 16000
SQL_QUERY_TIMEOUT               = 0
SQL_TXN_TIMEOUT                 = 0  
;SQL_NO_CHAR_C_ESCAPE           = 1
;SQL_UTF8_EXECS                 = 0
;SQL_NO_SYSTEM_TABLES           = 0
;SQL_BINARY_TIMESTAMP           = 1
;SQL_ENCRYPTION_ON_PASSWORD     = -1

[VDB]
ArrayOptimization               = 0
NumArrayParameters              = 10
VDBDisconnectTimeout            = 1000
KeepConnectionOnFixedThread     = 0

[Replication]
ServerName                      = db-IP-172-31-24-242
ServerEnable                    = 1
QueueMax                        = 5000000


[Striping]
Segment1                        = 100M, db-seg1-1.db, db-seg1-2.db
Segment2                        = 100M, db-seg2-1.db
;...


[Zero Config]
ServerName                      = virtuoso (IP-172-31-24-242)

[URIQA]
DynamicLocal                    = 0
DefaultHost                     = localhost:8890


[SPARQL]
;ExternalQuerySource            = 1
;ExternalXsltSource             = 1
;DefaultGraph                   = http://localhost:8890/dataspace
;ImmutableGraphs                = http://localhost:8890/dataspace
;ResultSetMaxRows               = 10000
ResultSetMaxRows                = 1000000000
;MaxQueryCostEstimationTime     = 400   ; in seconds
MaxQueryCostEstimationTime      = 4000000000000000      ; in seconds
;MaxQueryExecutionTime          = 60    ; in seconds
MaxQueryExecutionTime           = 600000000000000       ; in seconds
DefaultQuery                    = select distinct ?Concept where {[] a ?Concept} LIMIT 
100
DeferInferenceRulesInit         = 0  ; controls inference rules loading
;PingService                    = http://rpc.pingthesemanticweb.com/
MaxSortedTopRows                = 10000000

[Plugins]
LoadPath                        = /usr/lib/virtuoso/hosting
Load1                           = plain, wikiv
Load2                           = plain, mediawiki
Load3                           = plain, creolewiki
Load4                   = plain, im

如果我错过了一些微不足道的事情,请告诉我,但这些查询的结果对我来说没有意义。

1 个答案:

答案 0 :(得分:0)

由于您正在执行许多完全不同的查询,因此很难确定您的确切问题。如果你想分离原因,最好的办法就是做一些小改动。

另外:您的所有查询都是语法上非法的SPARQL,这使得很难判断出现了什么问题。特别是你制定'AS'别名的方式是不正确的 - 一方面它们应该括在括号中,其次你不应该为已经存在的变量别名。例如,而不是像:

str(?birthDate) as ?birthDate

你应该做的事情如下:

(str(?birthDate) as ?bd)

除此之外,在您的第一个查询中,您将偏移设置为值100000.据推测,您没有得到任何答案只是因为少于100000个结果。

在您的第二个查询中,您将获得50000个结果,这可能准确反映了符合条件的实际人数。再次,使用“AS”别名命令尝试将变量“重新绑定”到新值时,查询有点奇怪。

最后,最后一个查询只检索有关Person类型资源的所有三元组。因为你没有进一步限制,所以这个结果要大得多并不奇怪。结果中的每一行都是特定人员的一个属性 - 值组合。

我建议您查看基本的SPARQL教程,因为我认为您可能缺少一些基础知识。 SPARQL需要一些习惯,但是一旦你学习了基础知识(比如图形模式匹配实际意味着什么),你应该发现编写自己的查询要容易得多。