我的目标是创建一个以String作为键的hashmap,并将条目值作为字符串的HashSet。
输出的
这就是输出现在的样子:
Hudson+(surname)=[Q2720681], Hudson,+Quebec=[Q141445], Hudson+(given+name)=[Q5928530], Hudson,+Colorado=[Q2272323], Hudson,+Illinois=[Q2672022], Hudson,+Indiana=[Q2710584], Hudson,+Ontario=[Q5928505], Hudson,+Buenos+Aires+Province=[Q10298710], Hudson,+Florida=[Q768903]]
根据我的想法,它应该是这样的:
[Hudson+(surname)=[Q2720681,Q141445,Q5928530,Q2272323,Q2672022]]
目的是在Wikidata中存储特定名称,然后将与其相关的所有Q值消除歧义,例如:
This是“布什”的页面。
我希望布什成为关键,然后对于所有不同的出发点,Bush
可以与维基数据的终端页面关联的所有不同方式,我想存储相应的“ Q值“,或唯一的字母数字标识符。
我实际上在做的是尝试从维基百科消歧中删除不同的名称,值,然后在wikidata中查找与该值相关联的唯一字母数字标识符。
例如,Bush
我们有:
George H. W. Bush
George W. Bush
Jeb Bush
Bush family
Bush (surname)
因此,Q值为:
George H. W. Bush(Q23505)
George W. Bush(Q207)
Jeb Bush(Q221997)
Bush family(Q2743830)
Bush(Q1484464)
我的想法是数据结构应按以下方式解释
键: Bush
条目设置: Q23505, Q207, Q221997, Q2743830, Q1484464
但我现在的代码并没有这样做。
它为每个名称和Q值创建一个单独的条目。即。
键: Jeb Bush
条目设置: Q221997
键: George W. Bush
条目设置: Q207
等等。
my github page可以看到所有荣耀的完整代码,但我也会在下面对其进行总结。
这就是我用来为我的数据结构添加值的原因:
// add Q values to their arrayList in the hash map at the index of the appropriate entity
public static HashSet<String> put_to_hash(String key, String value)
{
if (!q_valMap.containsKey(key))
{
return q_valMap.put(key, new HashSet<String>() );
}
HashSet<String> list = q_valMap.get(key);
list.add(value);
return q_valMap.put(key, list);
}
这是我获取内容的方式:
while ((line_by_line = wiki_data_pagecontent.readLine()) != null)
{
// if we can determine it's a disambig page we need to send it off to get all
// the possible senses in which it can be used.
Pattern disambig_pattern = Pattern.compile("<div class=\"wikibase-entitytermsview-heading-description \">Wikipedia disambiguation page</div>");
Matcher disambig_indicator = disambig_pattern.matcher(line_by_line);
if (disambig_indicator.matches())
{
//off to get the different usages
Wikipedia_Disambig_Fetcher.all_possibilities( variable_entity );
}
else
{
//get the Q value off the page by matching
Pattern q_page_pattern = Pattern.compile("<!-- wikibase-toolbar --><span class=\"wikibase-toolbar-container\"><span class=\"wikibase-toolbar-item " +
"wikibase-toolbar \">\\[<span class=\"wikibase-toolbar-item wikibase-toolbar-button wikibase-toolbar-button-edit\"><a " +
"href=\"/wiki/Special:SetSiteLink/(.*?)\">edit</a></span>\\]</span></span>");
Matcher match_Q_component = q_page_pattern.matcher(line_by_line);
if ( match_Q_component.matches() )
{
String Q = match_Q_component.group(1);
// 'Q' should be appended to an array, since each entity can hold multiple
// Q values on that basis of disambig
put_to_hash( variable_entity, Q );
}
}
}
这就是我处理消歧页面的方式:
public static void all_possibilities( String variable_entity ) throws Exception
{
System.out.println("this is a disambig page");
//if it's a disambig page we know we can go right to the wikipedia
//get it's normal wiki disambig page
Document docx = Jsoup.connect( "https://en.wikipedia.org/wiki/" + variable_entity ).get();
//this can handle the less structured ones.
Elements linx = docx.select( "p:contains(" + variable_entity + ") ~ ul a:eq(0)" );
for (Element linq : linx)
{
System.out.println(linq.text());
String linq_nospace = linq.text().replace(' ', '+');
Wikidata_Q_Reader.getQ( linq_nospace );
}
}
我想也许我可以传递Key
值,但我真的不知道。我有点卡住了。也许有人可以看到我如何实现这个功能。
答案 0 :(得分:2)
I'm not clear from your question what isn't working, or if you're seeing actual errors. But, while your basic data structure idea (HashMap
of String
to Set<String>
) is sound, there's a bug in the "add" function.
public static HashSet<String> put_to_hash(String key, String value)
{
if (!q_valMap.containsKey(key))
{
return q_valMap.put(key, new HashSet<String>() );
}
HashSet<String> list = q_valMap.get(key);
list.add(value);
return q_valMap.put(key, list);
}
In the case where a key is seen for the first time (if (!q_valMap.containsKey(key))
), it vivifies a new HashSet
for that key, but it doesn't add value
to it before returning. (And the returned value is the old value for that key, so it'll be null.) So you're going to be losing one of the Q-values for every term.
For multi-layered data structures like this, I usually special-case just the vivification of the intermediate structure, and then do the adding and return in a single code path. I think this would fix it. (I'm also going to call it valSet
because it's a set and not a list. And there's no need to re-add the set to the map each time; it's a reference type and gets added the first time you encounter that key.)
public static HashSet<String> put_to_hash(String key, String value)
{
if (!q_valMap.containsKey(key)) {
q_valMap.put(key, new HashSet<String>());
}
HashSet<String> valSet = q_valMap.get(key);
valSet.add(value);
return valSet;
}
Also be aware that the Set
you return is a reference to the live Set
for that key, so you need to be careful about modifying it in callers, and if you're doing multithreading you're going to have concurrent access issues.
Or just use a Guava Multimap
so you don't have to worry about writing the implementation yourself.