自动区域分类

时间:2013-04-04 03:46:04

标签: python django design-patterns

我正在从几个新闻机构中搜集故事,我想创建一个过滤器来自动对故事进行分类。

我现在在我的数据库中有一个表格,其中包含世界上所有国家/地区以及包含其城市的相关表格。

所以我现在很困惑如何处理这个问题。我能想到的唯一解决方案是将故事分开,然后将每个单词与每个国家和城市进行比较,这将消耗我相信的大量资源。

请建议。

1 个答案:

答案 0 :(得分:0)

我不确定在你找到一个城市之前搜索这个故事的效率会低于世界新闻报道的效率。我刚刚看了BBC World News homepage,第一段提到了大多数城市。如果我从中获得任何形式的洞察力,我可能会让统计学家哭泣,但我有预感,这对大多数世界新闻报道都可能是正确的。

不幸的是,我不熟悉Python,但这是一个可以实现这一目标的C#程序。

class Program
{
    //You could have 2 Hashsets, one for cities and one for countries and find both but you can always derive a country from a city so you won't need as much memory to reduce the load factor of the hash table. 
    //However this does mean if an article mentions only a country and not a city you can't categorize it.
    static HashSet<String> cities = new HashSet<String>();
    static Dictionary<String, String> cityToCountry = new Dictionary<String, String>();
    const int MAX_WORDS_TO_READ = 200;

    static void Main(string[] args)
    {
        addCities();
        String sentance = "Former South African President Nelson Mandela is discharged from hospital in Johannesburg after 10 days of treatment for pneumonia.";

        String city = findCity(sentance);
        String country = cityToCountry[city];

        Console.WriteLine(city + ", " + country);
        Console.ReadLine();


    }

    static String findCity(String sentance)
    {

        String word = "";
        int wordsSeen = 0;

        foreach (char c in sentance)
        {
            if (c != ' ')
            {
                word += c;
            }
            else
            {
                if (isCity(word))
                {
                    return word;
                }
                else
                {
                    word = "";
                }

                wordsSeen++;
            }

            //If you assume that if the city is not in the first n words then the article isn't about a specific city or one that you don't have in your database
            //then you can avoid having to check through massive articles that have no city
            if(wordsSeen > MAX_WORDS_TO_READ)
            {
                return null;
            }
        }

        return null;
    }

    static bool isCity(String city)
    {
        return cities.Contains(city);
    }


    static void addCities()
    {
        //Naturally you would parse in a list of cities from somewhere
        cities.Add("Berlin");
        cities.Add("London");
        cities.Add("Johannesburg");

        cityToCountry.Add("Berlin", "Germany");
        cityToCountry.Add("London", "England");
        cityToCountry.Add("Johannesburg", "South Africa");
    }




}

另一方面,如果你看看英国广播公司新闻英格兰部分,那么你最终会得到stories like this,它甚至没有在文章本身中提及国家,如果你的城市列表中包含奥克汉普顿我我非常惊讶。要解决此问题,您可以使用故事所在的新闻网站部分提供的上下文,但最重要的是此方法的效率取决于您正在抓取的新闻报道的类型。