我正在尝试获取所有维基百科人员的列表,其中包含尽可能多的功能以解决某些机器学习问题。
我已经设置了一个本地DBPedia服务器,并且已经增加了各种参数的限制,但不知怎的,我仍然无法获得所需的结果。
所需输出为以下格式的CSV:
<Person1>,<Feature1>,<Feature2>,<Feature3> .......... and so on
<Person2>,<Feature1>,<Feature2>,<Feature3> .......... and so on
<Person3>,<Feature1>,<Feature2>,<Feature3> .......... and so on
...and
...so
...on
有人可以指导我做正确的方法吗?
例如,当我运行此查询时,我得到了blank
结果:
QUERY:
SELECT ?name ?birthDate WHERE {
{
SELECT strafter(str(?person),"http://dbpedia.org/resource/") as ?name, str(?
birthDate) as ?birthDate WHERE {
?person a <http://dbpedia.org/ontology/Person> .
?person dbpedia-owl:birthDate ?birthDate .
}
ORDER BY ASC(?name)
}
}
OFFSET 100000
LIMIT 500
结果: [[名称]] [[生日]]
但是当我运行此查询时,我只获得了50000
行数
QUERY:
SELECT strafter(str(?person),"http://dbpedia.org/resource/") as ?name, str(?birthDate)
as ?birthDate, str(?birthName) as ?birthName, strafter(str(?
occupation),"http://dbpedia.org/resource/") as ?occupation WHERE {
?person a <http://dbpedia.org/ontology/Person> .
?person dbpedia-owl:birthDate ?birthDate .
?person dbpedia-owl:birthName ?birthName .
?person dbpedia-owl:occupation ?occupation .
}
结果: &lt;&lt;&lt; 50000 rows&gt;&gt;
奇怪的是,这个查询似乎有效(至少是一个很好的数字) -
QUERY:
select ?s ?p ?o { ?s a dbpedia-owl:Person ; ?p ?o }
结果: &LT;&LT; 1051038行&gt;&gt;
我的virtuoso.ini文件:
[Database]
DatabaseFile = /var/lib/virtuoso/db/virtuoso.db
ErrorLogFile = /var/lib/virtuoso/db/virtuoso.log
LockFile = /var/lib/virtuoso/db/virtuoso.lck
TransactionFile = /var/lib/virtuoso/db/virtuoso.trx
xa_persistent_file = /var/lib/virtuoso/db/virtuoso.pxa
ErrorLogLevel = 7
FileExtend = 200
;MaxCheckpointRemap = 2000
MaxCheckpointRemap = 1362500
Striping = 0
TempStorage = TempDatabase
[TempDatabase]
DatabaseFile = /var/lib/virtuoso/db/virtuoso-temp.db
TransactionFile = /var/lib/virtuoso/db/virtuoso-temp.trx
MaxCheckpointRemap = 2000
Striping = 0
[Parameters]
ServerPort = 1111
LiteMode = 0
DisableUnixSocket = 1
DisableTcpSocket = 0
;SSLServerPort = 2111
;SSLCertificate = cert.pem
;SSLPrivateKey = pk.pem
;X509ClientVerify = 0
;X509ClientVerifyDepth = 0
;X509ClientVerifyCAFile = ca.pem
ServerThreads = 20
CheckpointInterval = 60
O_DIRECT = 0
CaseMode = 2
MaxStaticCursorRows = 500000000
CheckpointAuditTrail = 0
AllowOSCalls = 0
SchedulerInterval = 10
DirsAllowed = ., /usr/share/virtuoso/vad, /usr/local/data/datasets
ThreadCleanupInterval = 0
ThreadThreshold = 10
ResourcesCleanupInterval = 0
FreeTextBatchSize = 100000
SingleCPU = 0
VADInstallDir = /usr/share/virtuoso/vad/
PrefixResultNames = 0
RdfFreeTextRulesSize = 100
IndexTreeMaps = 256
MaxMemPoolSize = 200000000
PrefixResultNames = 0
MacSpotlight = 0
IndexTreeMaps = 64
MaxSortedTopRows = 100000000
;;
;; Uncomment next two lines if there is 64 GB system memory free
NumberOfBuffers = 5450000
MaxDirtyBuffers = 4000000
;;
[HTTPServer]
ServerPort = 8890
ServerRoot = /var/lib/virtuoso/vsp
ServerThreads = 20
DavRoot = DAV
EnabledDavVSP = 0
HTTPProxyEnabled = 0
TempASPXDir = 0
DefaultMailServer = localhost:25
ServerThreads = 10
MaxKeepAlives = 10
KeepAliveTimeout = 10
MaxCachedProxyConnections = 10
ProxyConnectionCacheTimeout = 15
HTTPThreadSize = 280000
HttpPrintWarningsInOutput = 0
Charset = UTF-8
;HTTPLogFile = logs/http.log
[AutoRepair]
BadParentLinks = 0
[Client]
SQL_PREFETCH_ROWS = 100
SQL_PREFETCH_BYTES = 16000
SQL_QUERY_TIMEOUT = 0
SQL_TXN_TIMEOUT = 0
;SQL_NO_CHAR_C_ESCAPE = 1
;SQL_UTF8_EXECS = 0
;SQL_NO_SYSTEM_TABLES = 0
;SQL_BINARY_TIMESTAMP = 1
;SQL_ENCRYPTION_ON_PASSWORD = -1
[VDB]
ArrayOptimization = 0
NumArrayParameters = 10
VDBDisconnectTimeout = 1000
KeepConnectionOnFixedThread = 0
[Replication]
ServerName = db-IP-172-31-24-242
ServerEnable = 1
QueueMax = 5000000
[Striping]
Segment1 = 100M, db-seg1-1.db, db-seg1-2.db
Segment2 = 100M, db-seg2-1.db
;...
[Zero Config]
ServerName = virtuoso (IP-172-31-24-242)
[URIQA]
DynamicLocal = 0
DefaultHost = localhost:8890
[SPARQL]
;ExternalQuerySource = 1
;ExternalXsltSource = 1
;DefaultGraph = http://localhost:8890/dataspace
;ImmutableGraphs = http://localhost:8890/dataspace
;ResultSetMaxRows = 10000
ResultSetMaxRows = 1000000000
;MaxQueryCostEstimationTime = 400 ; in seconds
MaxQueryCostEstimationTime = 4000000000000000 ; in seconds
;MaxQueryExecutionTime = 60 ; in seconds
MaxQueryExecutionTime = 600000000000000 ; in seconds
DefaultQuery = select distinct ?Concept where {[] a ?Concept} LIMIT
100
DeferInferenceRulesInit = 0 ; controls inference rules loading
;PingService = http://rpc.pingthesemanticweb.com/
MaxSortedTopRows = 10000000
[Plugins]
LoadPath = /usr/lib/virtuoso/hosting
Load1 = plain, wikiv
Load2 = plain, mediawiki
Load3 = plain, creolewiki
Load4 = plain, im
如果我错过了一些微不足道的事情,请告诉我,但这些查询的结果对我来说没有意义。
答案 0 :(得分:0)
由于您正在执行许多完全不同的查询,因此很难确定您的确切问题。如果你想分离原因,最好的办法就是做一些小改动。
另外:您的所有查询都是语法上非法的SPARQL,这使得很难判断出现了什么问题。特别是你制定'AS'别名的方式是不正确的 - 一方面它们应该括在括号中,其次你不应该为已经存在的变量别名。例如,而不是像:
str(?birthDate) as ?birthDate
你应该做的事情如下:
(str(?birthDate) as ?bd)
除此之外,在您的第一个查询中,您将偏移设置为值100000.据推测,您没有得到任何答案只是因为少于100000个结果。
在您的第二个查询中,您将获得50000个结果,这可能准确反映了符合条件的实际人数。再次,使用“AS”别名命令尝试将变量“重新绑定”到新值时,查询有点奇怪。
最后,最后一个查询只检索有关Person
类型资源的所有三元组。因为你没有进一步限制,所以这个结果要大得多并不奇怪。结果中的每一行都是特定人员的一个属性 - 值组合。
我建议您查看基本的SPARQL教程,因为我认为您可能缺少一些基础知识。 SPARQL需要一些习惯,但是一旦你学习了基础知识(比如图形模式匹配实际意味着什么),你应该发现编写自己的查询要容易得多。