如何以编程方式上传到Dydra时防止三元组混淆?

时间:2015-12-22 19:15:39

标签: python semantic-web sesame rdflib linked-data

我正在尝试从计算机上的Sesame triplestore上传一些数据给Dydra。虽然从Sesame下载工作正常,但三元组混合起来(s-p-o关系随着一个对象成为另一个对象而变化)。有人可以解释为什么会发生这种情况以及如何解决这个问题?代码如下:

#Querying the triplestore to retrieve all results
sesameSparqlEndpoint = 'http://my.ip.ad.here:8080/openrdf-sesame/repositories/rep_name'
sparql = SPARQLWrapper(sesameSparqlEndpoint)
queryStringDownload = 'SELECT * WHERE {?s ?p ?o}'
dataGraph = Graph()

sparql.setQuery(queryStringDownload)
sparql.method = 'GET'
sparql.setReturnFormat(JSON)
output = sparql.query().convert()
print output

for i in range(len(output['results']['bindings'])):
    #The encoding is necessary to parse non-English characters
    output['results']['bindings'][i]['s']['value'].encode('utf-8')
    try:
        subject_extract = output['results']['bindings'][i]['s']['value']
        if 'http' in subject_extract:
            subject = "<" + subject_extract + ">"
            subject_url = URIRef(subject)
            print subject_url

        predicate_extract = output['results']['bindings'][i]['p']['value']
        if 'http' in predicate_extract:
            predicate = "<" + predicate_extract + ">"
            predicate_url = URIRef(predicate)
            print predicate_url

        objec_extract = output['results']['bindings'][i]['o']['value']
        if 'http' in objec_extract:
            objec = "<" + objec_extract + ">"
            objec_url = URIRef(objec)
            print objec_url
        else:
            objec = objec_extract
            objec_wip = '"' + objec + '"'
            objec_url = URIRef(objec_wip)

        # Loading the data on a graph       
        dataGraph.add((subject_url,predicate_url,objec_url))

    except UnicodeError as error: 
        print error

#Print all statements in dataGraph      
for stmt in dataGraph:
    pprint.pprint(stmt)

# Upload to Dydra
URL = 'http://dydra.com/login'
key = 'my_key'

with requests.Session() as s:
    resp = s.get(URL)
    soup = BeautifulSoup(resp.text,"html5lib")
    csrfToken = soup.find('meta',{'name':'csrf-token'}).get('content')
    # print csrf_token
    payload = {
    'account[login]':key,
    'account[password]':'',
    'csrfmiddlewaretoken':csrfToken,
    'next':'/'
    }
    # print payload

    p = s.post(URL,data=payload, headers=dict(Referer=URL))
    # print p.text

    r = s.get('http://dydra.com/username/rep_name/sparql')
    # print r.text

    dydraSparqlEndpoint = 'http://dydra.com/username/rep_name/sparql'
    for stmt in dataGraph:
        queryStringUpload = 'INSERT DATA {%s %s %s}' % stmt
        sparql = SPARQLWrapper(dydraSparqlEndpoint)
        sparql.setCredentials(key,key)
        sparql.setQuery(queryStringUpload)
        sparql.method = 'POST'
        sparql.query()

2 个答案:

答案 0 :(得分:1)

复制数据的简单方法(除了使用CONSTRUCT查询而不是SELECT,就像我在评论中提到的那样)只是让Dydra本身直接访问您的Sesame端点,例如通过SERVICE子句。

在Dydra数据库上执行以下操作,并且(一段时间后,根据您的Sesame数据库的大小),所有内容都将被复制:

   INSERT { ?s ?p ?o }
   WHERE { 
      SERVICE <http://my.ip.ad.here:8080/openrdf-sesame/repositories/rep_name> 
      { ?s ?p ?o }
   }

如果上述内容在Dydra上不起作用,您也可以使用URI http://my.ip.ad.here:8080/openrdf-sesame/repositories/rep_name/statements直接从Sesame商店访问RDF语句。假设Dydra有一个上传功能,您可以在其中提供RDF文档的URL,您可以简单地为其提供上述URI,并且它应该能够加载它。

答案 1 :(得分:0)

如果进行了以下更改,则上述代码可以正常工作:

  1. 使用CONSTRUCT查询而不是SELECT。详情 - &gt; How to iterate over CONSTRUCT output from rdflib?
  2. 使用密钥作为帐户[登录]和帐户[密码]
  3. 的输入

    然而,这可能不是最有效的方式。首先,为每个三元组执行单独的INSERT并不是一个好方法。 Dydra没有以这种方式记录所有语句(我只插入了大约30%的三元组)。相反,使用Jeen建议的http://my.ip.ad.here:8080/openrdf-sesame/repositories/rep_name/statements方法使我能够成功移植所有数据。