使用与HashSet对应的固定Key创建HashMap。出发点

时间:2015-04-23 04:58:51

标签: java hashmap hashset wikipedia

我的目标是创建一个以String作为键的hashmap,并将条目值作为字符串的HashSet。

输出

这就是输出现在的样子:

Hudson+(surname)=[Q2720681], Hudson,+Quebec=[Q141445], Hudson+(given+name)=[Q5928530], Hudson,+Colorado=[Q2272323], Hudson,+Illinois=[Q2672022], Hudson,+Indiana=[Q2710584], Hudson,+Ontario=[Q5928505], Hudson,+Buenos+Aires+Province=[Q10298710], Hudson,+Florida=[Q768903]]

根据我的想法,它应该是这样的:

[Hudson+(surname)=[Q2720681,Q141445,Q5928530,Q2272323,Q2672022]]

目的是在Wikidata中存储特定名称,然后将与其相关的所有Q值消除歧义,例如:

This是“布什”的页面。

我希望布什成为关键,然后对于所有不同的出发点,Bush可以与维基数据的终端页面关联的所有不同方式,我想存储相应的“ Q值“,或唯一的字母数字标识符。

我实际上在做的是尝试从维基百科消歧中删除不同的名称,值,然后在wikidata中查找与该值相关联的唯一字母数字标识符。

例如,Bush我们有:

George H. W. Bush 
George W. Bush
Jeb Bush
Bush family
Bush (surname) 

因此,Q值为:

George H. W. Bush(Q23505)

George W. Bush(Q207)

Jeb Bush(Q221997)

Bush family(Q2743830)

Bush(Q1484464)

我的想法是数据结构应按以下方式解释

键: Bush 条目设置: Q23505, Q207, Q221997, Q2743830, Q1484464

但我现在的代码并没有这样做。

它为每个名称和Q值创建一个单独的条目。即。

键: Jeb Bush 条目设置: Q221997

键: George W. Bush 条目设置: Q207

等等。

my github page可以看到所有荣耀的完整代码,但我也会在下面对其进行总结。

这就是我用来为我的数据结构添加值的原因:

// add Q values to their arrayList in the hash map at the index of the appropriate entity
public static HashSet<String> put_to_hash(String key, String value) 
{
    if (!q_valMap.containsKey(key)) 
    {
        return q_valMap.put(key, new HashSet<String>() );
    }
    HashSet<String> list = q_valMap.get(key);
    list.add(value);
    return q_valMap.put(key, list);
}

这是我获取内容的方式:

    while ((line_by_line = wiki_data_pagecontent.readLine()) != null) 
    {
        // if we can determine it's a disambig page we need to send it off to get all 
        // the possible senses in which it can be used.
        Pattern disambig_pattern = Pattern.compile("<div class=\"wikibase-entitytermsview-heading-description \">Wikipedia disambiguation page</div>");
        Matcher disambig_indicator = disambig_pattern.matcher(line_by_line);
        if (disambig_indicator.matches()) 
        {
            //off to get the different usages
            Wikipedia_Disambig_Fetcher.all_possibilities( variable_entity );
        }
        else
        {
            //get the Q value off the page by matching
            Pattern q_page_pattern = Pattern.compile("<!-- wikibase-toolbar --><span class=\"wikibase-toolbar-container\"><span class=\"wikibase-toolbar-item " +
                    "wikibase-toolbar \">\\[<span class=\"wikibase-toolbar-item wikibase-toolbar-button wikibase-toolbar-button-edit\"><a " +
                    "href=\"/wiki/Special:SetSiteLink/(.*?)\">edit</a></span>\\]</span></span>");

            Matcher match_Q_component = q_page_pattern.matcher(line_by_line);
            if ( match_Q_component.matches() ) 
            {
                String Q = match_Q_component.group(1);

                // 'Q' should be appended to an array, since each entity can hold multiple
                // Q values on that basis of disambig
                put_to_hash( variable_entity, Q );
            }
        }

    }

这就是我处理消歧页面的方式:

public static void all_possibilities( String variable_entity ) throws Exception
{
    System.out.println("this is a disambig page");
    //if it's a disambig page we know we can go right to the wikipedia


    //get it's normal wiki disambig page
    Document docx = Jsoup.connect( "https://en.wikipedia.org/wiki/" + variable_entity ).get();



        //this can handle the less structured ones. 
        Elements linx = docx.select( "p:contains(" + variable_entity + ") ~ ul a:eq(0)" );

        for (Element linq : linx) 
        {
            System.out.println(linq.text());
            String linq_nospace = linq.text().replace(' ', '+');
            Wikidata_Q_Reader.getQ( linq_nospace );

        }

}

我想也许我可以传递Key值,但我真的不知道。我有点卡住了。也许有人可以看到我如何实现这个功能。

1 个答案:

答案 0 :(得分:2)

I'm not clear from your question what isn't working, or if you're seeing actual errors. But, while your basic data structure idea (HashMap of String to Set<String>) is sound, there's a bug in the "add" function.

public static HashSet<String> put_to_hash(String key, String value) 
{
    if (!q_valMap.containsKey(key)) 
    {
        return q_valMap.put(key, new HashSet<String>() );
    }
    HashSet<String> list = q_valMap.get(key);
    list.add(value);
    return q_valMap.put(key, list);
}

In the case where a key is seen for the first time (if (!q_valMap.containsKey(key))), it vivifies a new HashSet for that key, but it doesn't add value to it before returning. (And the returned value is the old value for that key, so it'll be null.) So you're going to be losing one of the Q-values for every term.

For multi-layered data structures like this, I usually special-case just the vivification of the intermediate structure, and then do the adding and return in a single code path. I think this would fix it. (I'm also going to call it valSet because it's a set and not a list. And there's no need to re-add the set to the map each time; it's a reference type and gets added the first time you encounter that key.)

public static HashSet<String> put_to_hash(String key, String value) 
{
    if (!q_valMap.containsKey(key)) {
        q_valMap.put(key, new HashSet<String>());
    } 
    HashSet<String> valSet = q_valMap.get(key);
    valSet.add(value);
    return valSet;
}

Also be aware that the Set you return is a reference to the live Set for that key, so you need to be careful about modifying it in callers, and if you're doing multithreading you're going to have concurrent access issues.

Or just use a Guava Multimap so you don't have to worry about writing the implementation yourself.