Question

伙计如果我们输入＆＃39; Gutenberg频率列表＆＃39;我们如何删除所有非英文字符和使用NLTK的字词？该计划？

Plzz帮助 - 卡里姆

Answer 1

我之前从未与nltk合作过。也可能有更好的解决方案。在我的代码片段中，我只是执行以下操作：

将名为frequencyList.txt的非英语/英语单词需要检查的文件读取到名为lines的变量。
然后我打开一个名为eng_words_only.txt的新文件。此文件仅包含英文单词。最初此文件为空，稍后执行脚本后，此文件将包含frequencyList.txt
现在，对于frequencyList.txt中的每个字词，我都会检查它是否也出现在wordnet中。如果该单词存在，那么我将此单词写入eng_words_only.txt文件，否则我什么都不做。请参阅我正在使用wordnet仅用于演示目的。它不包含所有英语单词！

代码：

from nltk.corpus import wordnet

fList = open("frequencyList.txt","r")#Read the file
lines = fList.readlines()

eWords = open("eng_words_only.txt", "a")#Open file for writing

for w in lines:
    if not wordnet.synsets(w):#Comparing if word is non-English
        print 'not '+w
    else:#If word is an English word
        print 'yes '+w
        eWords.write(w)#Write to file 

eWords.close()#Close the file

测试：我首先创建了一个名为frequencyList.txt的文件，其中包含以下内容：

cat 
meoooow 
mouse

然后在执行代码段时，您将在控制台中看到以下输出：

not cat

not meoooow

yes mouse

然后将创建一个文件eng_words_only.txt，其中只包含应该是英语的单词。 eng_words_only.txt仅包含mouse个字词。您可能会注意到cat是一个英文单词，但它仍然不在eng_words_only.txt文件中。这就是你应该使用一个好的来源而不是wordnet的原因。 请注意：python脚本文件和frequencyList.txt应位于同一目录中。此外，您可以使用要检查/调查的任何文件，而不是frequencyList.txt。在这种情况下，请不要忘记更改代码段中的文件名。

第二个解决方案：虽然你没有要求它，但仍然有另一种方法来做这个英文单词测试。

以下是代码：wordlist-eng.txt是包含英文单词的文件。你必须保持

wordlist-eng.txt，frequencyList.txt和同一目录中的python脚本。

with open("wordlist-eng.txt") as word_file:
    english_words = set(word.strip().lower() for word in word_file)

fList = open("frequencyList.txt","r")
lines = fList.readlines()
fList.close()

eWords = open("eng_words_only.txt", "a")

for w in lines:
    if w.strip().lower() in english_words:
        eWords.write(w)
    else: pass
eWords.close()

执行脚本后，eng_words_only.txt将包含frequencyList.txt文件中的所有英文单词。

我希望这很有帮助。

如何使用NLTK删除所有非英语字符和单词＆gt;

1 个答案: