Question

我正在使用Nltk和Scikit Learn进行一些文字处理。但是，在我的文件清单中，我有一些非英文文件。例如，以下情况可能属实：

[ "this is some text written in English", 
  "this is some more text written in English", 
  "Ce n'est pas en anglais" ]

出于我的分析目的，我希望将所有非英语句子作为预处理的一部分删除。但是，有一个很好的方法吗？我一直在谷歌搜索，但找不到任何具体的东西，让我能够识别字符串是否为英文。这是Nltk或Scikit learn中未提供功能的内容吗？编辑我看过像this和this这样的问题，但两者都是针对单个词......不是“文档”。我是否必须遍历句子中的每个单词以检查整个句子是否是英文的？

我正在使用Python，所以Python中的库会更受欢迎，但我可以根据需要切换语言，只是认为Python是最好的。

Answer 1

有一个名为langdetect的库。它来自谷歌的语言检测：

https://pypi.python.org/pypi/langdetect

它支持55种开箱即用的语言。

Answer 2

您可能对我的论文The WiLI benchmark dataset for written language identification感兴趣。我还对几个工具进行了基准测试。

TL; DR：

CLD-2相当不错且非常快
lang-detect稍微好一些，但要慢得多
langid很好，但CLD-2和lang-detect要好得多
NLTK的Textcat既不高效又无效。

您可以安装lidtk并对语言进行分类：

$ lidtk cld2 predict --text "this is some text written in English"
eng
$ lidtk cld2 predict --text "this is some more text written in English"
eng
$ lidtk cld2 predict --text "Ce n'est pas en anglais"                  
fra

Answer 3

使用附魔库

import enchant

dictionary = enchant.Dict("en_US") #also available are en_GB, fr_FR, etc

dictionary.check("Hello") # prints True
dictionary.check("Helo") #prints False

此示例直接取自website

Answer 4

如果你想要轻量级的东西，字母三元组是一种流行的方法。每种语言都有不同的＆＃34;简介＆＃34;普通和不常见的三卦。您可以谷歌搜索它，或编写自己的代码。这是我遇到的一个示例实现，它使用＆＃34;余弦相似性＆＃34;作为样本文本与参考数据之间距离的度量：

http://code.activestate.com/recipes/326576-language-detection-using-character-trigrams/

如果您知道语料库中常见的非英语语言，那么很容易将其转换为是/否测试。如果你不这样做，你需要预测来自你没有三元组统计数据的语言的句子。我会做一些测试，以查看文档中单句文本的正常相似性分数范围，并选择合适的英语余弦分数阈值。

Answer 5

这是我前一段时间使用的。它适用于长度超过3个单词且少于3个无法识别的单词的文本。当然，您可以使用这些设置，但是对于我的用例（网站抓取），这些设置效果很好。

from enchant.checker import SpellChecker

max_error_count = 4
min_text_length = 3

def is_in_english(quote):
  d = SpellChecker("en_US")
  d.set_text(quote)
  errors = [err.word for err in d]
  return False if ((len(errors) > max_error_count) or len(quote.split()) < min_text_length) else True

print(is_in_english('“办理美国加州州立大学圣贝纳迪诺分校高仿成绩单Q/V2166384296加州州立大学圣贝纳迪诺分校学历学位认证'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))

> False
> True

Answer 6

预训练的快速文本模型最适合我的类似需求

我对您的问题有非常相似的需求。我感谢马丁·托马的回答。但是，我从Rabash的答案第7部分HERE中获得了最大的帮助。

在尝试找到最适合我的需求（确定文本文件中的英文文件超过60,000个）后，我发现fasttext是一个很好的工具。

做了一些工作，我有一个工具可以快速处理许多文件。下面是带有注释的代码。我相信您和其他人将能够根据您的特定需求修改此代码。

class English_Check:
    def __init__(self):
        # Don't need to train a model to detect languages. A model exists
        #    that is very good. Let's use it.
        pretrained_model_path = 'location of your lid.176.ftz file from fasttext'
        self.model = fasttext.load_model(pretrained_model_path)

    def predictionict_languages(self, text_file):
        this_D = {}
        with open(text_file, 'r') as f:
            fla = f.readlines()  # fla = file line array.
            # fasttext doesn't like newline characters, but it can take
            #    an array of lines from a file. The two list comprehensions
            #    below, just clean up the lines in fla
            fla = [line.rstrip('\n').strip(' ') for line in fla]
            fla = [line for line in fla if len(line) > 0]

            for line in fla:  # Language predict each line of the file
                language_tuple = self.model.predictionict(line)
                # The next two lines simply get at the top language prediction
                #    string AND the confidence value for that prediction.
                prediction = language_tuple[0][0].replace('__label__', '')
                value = language_tuple[1][0]

                # Each top language prediction for the lines in the file
                #    becomes a unique key for the this_D dictionary.
                #    Everytime that language is found, add the confidence
                #    score to the running tally for that language.
                if prediction not in this_D.keys():
                    this_D[prediction] = 0
                this_D[prediction] += value

        self.this_D = this_D

    def determine_if_file_is_english(self, text_file):
        self.predictionict_languages(text_file)

        # Find the max tallied confidence and the sum of all confidences.
        max_value = max(self.this_D.values())
        sum_of_values = sum(self.this_D.values())
        # calculate a relative confidence of the max confidence to all
        #    confidence scores. Then find the key with the max confidence.
        confidence = max_value / sum_of_values
        max_key = [key for key in self.this_D.keys()
                   if self.this_D[key] == max_value][0]

        # Only want to know if this is english or not.
        return max_key == 'en'

下面是我需要的上述类的应用/实例化和使用。

file_list = # some tool to get my specific list of files to check for English

en_checker = English_Check()
for file in file_list:
    check = en_checker.determine_if_file_is_english(file)
    if not check:
        print(file)

确定文本是否是英文？

6 个答案:

预训练的快速文本模型最适合我的类似需求