Question

我正在对python进行数据清理练习，我正在清理的文本包含我想删除的意大利语单词。我一直在网上搜索我是否可以使用像nltk这样的工具包在Python上执行此操作。

例如给出一些文字：

"Io andiamo to the beach with my amico."

我想留下：

"to the beach with my"

有谁知道如何做到这一点？任何帮助将非常感激。

Answer 1

您可以使用NLTK的words语料库：

import nltk
words = set(nltk.corpus.words.words())

sent = "Io andiamo to the beach with my amico."
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
         if w.lower() in words or not w.isalpha())
# 'Io to the beach with my'

不幸的是， Io 恰好是一个英文单词。一般来说，可能很难判断某个单词是否为英语。

Answer 2

这是一个很好的Python库，名为Enchant。它可以检查一个单词是否是英文。

从他们的主页：

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]

所以你可以这样做：

string =  "Io andiamo to the beach with my amico."
english_words = []
for word in string.split():
    if d.check(word):
        english_words.append(word)
print " ".join(english_words)

注意：小词难以确定语言，因为许多小词可以使用多种语言，因此上述代码的结果是：

Io to the beach with my

您希望Io被排除在哪里

Answer 3

在MAC OSX中，如果尝试此代码，它仍然可以显示异常。因此，请确保您手动下载词库。一旦styles.scss import库，使您可能像在Mac OS中那样，它不会自动下载单词corpus。因此，您可能必须下载它，否则将面临异常。

nltk

现在，您可以执行与前任指挥者相同的执行。

import nltk 
nltk.download('words')
words = set(nltk.corpus.words.words())

根据NLTK文档，事实并非如此。但是我在github上收到了issue，并解决了这种问题，它确实有效。如果没有在此处放置sent = "Io andiamo to the beach with my amico." sent = " ".join(w for w in nltk.wordpunct_tokenize(sent) if w.lower() in words or not w.isalpha())参数，则OSX可以注销并一次又一次地发生。

使用Python从文本中删除非英语单词

3 个答案: