我正在尝试按照下面提到的两个DBpedia属性构建主题层次结构。
我的目的是让这个词识别出它的主题。例如,给出这个词; '支持向量机',我想从中找出主题,如分类算法,机器学习等。
然而,有时我对如何构建主题层次结构感到有点困惑,因为我获得了超过5个主题URI和许多URI来获得更广泛的属性。有没有办法测量强度或其他东西,减少从DBpedia获得的额外URI,并只分配最高的可能URI?
似乎有两个问题。
我目前的代码如下。
from SPARQLWrapper import SPARQLWrapper, JSON
import requests
import urllib.parse
## initial consts
BASE_URL = 'http://api.dbpedia-spotlight.org/en/annotate?text={text}&confidence={confidence}&support={support}'
TEXT = 'First documented in the 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918), the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third Reich (1933–45). Berlin in the 1920s was the third largest municipality in the world. After World War II, the city became divided into East Berlin -- the capital of East Germany -- and West Berlin, a West German exclave surrounded by the Berlin Wall from 1961–89. Following German reunification in 1990, the city regained its status as the capital of Germany, hosting 147 foreign embassies.'
CONFIDENCE = '0.5'
SUPPORT = '120'
REQUEST = BASE_URL.format(
text=urllib.parse.quote_plus(TEXT),
confidence=CONFIDENCE,
support=SUPPORT
)
HEADERS = {'Accept': 'application/json'}
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
all_urls = []
r = requests.get(url=REQUEST, headers=HEADERS)
response = r.json()
resources = response['Resources']
for res in resources:
all_urls.append(res['@URI'])
for url in all_urls:
sparql.setQuery("""
SELECT * WHERE {<"""
+url+
""">skos:broader|dct:subject ?resource
}
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
for result in results["results"]["bindings"]:
print('resource ---- ', result['resource']['value'])
如果需要,我很乐意提供更多示例。
答案 0 :(得分:2)
您似乎正在尝试检索与给定段落相关的维基百科类别。
次要建议
首先,我建议您执行单个请求,将DBpedia Spotlight结果收集到VALUES
,例如,以这种方式:
values = '(<{0}>)'.format('>) (<'.join(all_urls))
其次,如果您正在讨论主题层次结构,则应使用SPARQL 1.1 property paths。
这两个建议略有不兼容。当查询包含多个起始点(即VALUES
)和任意长度路径(即*
和+
运算符)时,Virtuoso的效率非常低。
下面我使用dct:subject/skos:broader
属性路径,即检索“下一级”类别。
方法1
第一种方式是按照普遍受欢迎程度来订购资源,例如: G。他们的PageRank:
values = '(<{0}>)'.format('>) (<'.join(all_urls))
sparql.setQuery(
"""PREFIX vrank:<http://purl.org/voc/vrank#>
SELECT DISTINCT ?resource ?rank
FROM <http://dbpedia.org>
FROM <http://people.aifb.kit.edu/ath/#DBpedia_PageRank>
WHERE {
VALUES (?s) {""" + values +
""" }
?s dct:subject/skos:broader ?resource .
?resource vrank:hasRank/vrank:rankValue ?rank.
} ORDER BY DESC(?rank)
LIMIT 10
""")
结果是:
dbc:Member_states_of_the_United_Nations
dbc:Country_subdivisions_of_Europe
dbc:Republics
dbc:Demography
dbc:Population
dbc:Countries_in_Europe
dbc:Third-level_administrative_country_subdivisions
dbc:International_law
dbc:Former_countries_in_Europe
dbc:History_of_the_Soviet_Union_and_Soviet_Russia
方法2
第二种方法是计算给定文本的类别频率......
values = '(<{0}>)'.format('>) (<'.join(all_urls))
sparql.setQuery(
"""SELECT ?resource count(?resource) AS ?count WHERE {
VALUES (?s) {""" + values +
""" }
?s dct:subject ?resource
} GROUP BY ?resource
# https://github.com/openlink/virtuoso-opensource/issues/254
HAVING (count(?resource) > 1)
ORDER BY DESC(count(?resource))
LIMIT 10
""")
结果是:
dbc:Wars_by_country
dbc:Wars_involving_the_states_and_peoples_of_Europe
dbc:Wars_involving_the_states_and_peoples_of_Asia
dbc:Wars_involving_the_states_and_peoples_of_North_America
dbc:20th_century_in_Germany
dbc:Modern_history_of_Germany
dbc:Wars_involving_the_Balkans
dbc:Decades_in_Germany
dbc:Modern_Europe
dbc:Wars_involving_the_states_and_peoples_of_South_America
使用dct:subject
代替dct:subject/skos:broader
,结果会更好:
dbc:Former_polities_of_the_Cold_War
dbc:Former_republics
dbc:States_and_territories_established_in_1949
dbc:20th_century_in_Germany_by_period
dbc:1930s_in_Germany
dbc:Modern_history_of_Germany
dbc:1990_disestablishments_in_West_Germany
dbc:1933_disestablishments_in_Germany
dbc:1949_establishments_in_West_Germany
dbc:1949_establishments_in_Germany
<强>结论强>
结果不是很好。我看到两个原因:DBpedia类别非常随机,工具非常原始。也许有可能取得更好的结果,结合方法1和2.无论如何,需要大型语料库的实验。