guess_language模块给出了UNKNOWN

时间:2015-02-09 14:31:31

标签: python

我安装了(我在Windows 7中,但我在Python 2.7.5中使用了virtualenv):

pip install pyenchant
pip install 3to2
pip install https://bitbucket.org/spirit/guess_language/downloads/guess_language-spirit-0.5.tar.bz2

并做了:

>>> from guess_language import guess_language
>>> guess_language("Hello World")
u'UNKNOWN'

为什么我会u'UNKNOWN'

This is the project site

1 个答案:

答案 0 :(得分:2)

我建议您使用nltk。在nltk中会更容易。

import nltk

STOPWORDS_DICT = {lang: set(nltk.corpus.stopwords.words(lang))
                  for lang in nltk.corpus.stopwords.fileids()}

def get_language(text):
    words = set(nltk.wordpunct_tokenize(text.lower()))
    return max(((lang, len(words & stopwords))
                for lang, stopwords in STOPWORDS_DICT.items()),
               key = lambda x: x[1])[0]

现在看看代码在运作。

In [28]: get_language('hello world')
Out[28]: 'swedish'

In [30]: get_language('stackoverflow is a nice website')
Out[30]: 'english'

问题是如果示例文本非常小,它会给出错误的结果。

代码来自this网站。