使用python或javascript从文本中找出困难的英语单词以建立词汇

时间:2019-02-15 13:17:03

标签: javascript python text

我想从英文文本中在线获取难词,例如从gutenberg中使用python或javascript构建词汇。我不会说简单的单词,而只会讲独特的词汇,例如regal,apocryphal ..etc。

如何确保在拆分文本时,我只会获得唯一的词汇表而不是简单的单词。

3 个答案:

答案 0 :(得分:1)

如@Hoog所建议的,这是伪代码:

simple_words = [...]
difficult_words = [word for word in english_vocabulary if word not in simple_words]

答案 1 :(得分:0)

您还可以使用pop()从英语词典中删除最困难的单词列表。

答案 2 :(得分:0)

我定义了一个“非常见单词”,该单词不出现在前10000个最常见的英语单词中。

最常见的10 k个词是一个任意边界,但正如the github repo中所述:

  

根据对牛津英语语料库的分析,最常见的7,000种英语引语约占用法的90%,因此,10,000个单词的训练语料库已足以满足实际的培训应用。

import requests

english_most_common_10k = 'https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english-usa-no-swears.txt'

# Get the file of 10 k most common words from TXT file in a github repo
response = requests.get(english_most_common_10k)
data = response.text

set_of_common_words = {x for x in data.split('\n')}

# Once we have the set of common words, we can just check.
# The check is in average case O(1) operation,
# but you can use for example some sort of search three with O(log(n)) complexity
while True:
    word = input()
    if word in set_of_common_words:
        print(f'The word "{word}" is common')
    else:
        print(f'The word "{word}" is difficult')