验证在英文文本中正确使用“a”和“an” - Python

时间:2013-12-02 19:44:51

标签: python grammar

我想创建一个程序,从文件中读取文本,并在“a”和“an”使用不正确时指出。据我所知,一般规则是当下一个单词以元音开头时使用“an”。但是还应该考虑到也应该从文件中读取异常。

有人可以给我一些关于如何开始这个的提示和技巧。可以提供帮助的功能。

我很高兴: - )

我对Python很陌生。

4 个答案:

答案 0 :(得分:8)

这是一个解决方案,其中正确性定义为: an出现在以元音开头的单词之前,否则可以使用a

#!/usr/bin/env python
import itertools
import re
import sys

try:
    from future_builtins import map, zip
except ImportError: # Python 3 (or old Python versions)
    map, zip = map, zip
from operator import methodcaller

import nltk  # $ pip install nltk
from nltk.corpus import cmudict  # >>> nltk.download('cmudict')

def starts_with_vowel_sound(word, pronunciations=cmudict.dict()):
    for syllables in pronunciations.get(word, []):
        return syllables[0][-1].isdigit()  # use only the first one

def check_a_an_usage(words):
    # iterate over words pairwise (recipe from itertools)
    #note: ignore Unicode case-folding (`.casefold()`)
    a, b = itertools.tee(map(methodcaller('lower'), words)) 
    next(b, None)
    for a, w in zip(a, b):
        if (a == 'a' or a == 'an') and re.match('\w+$', w): 
            valid = (a == 'an') if starts_with_vowel_sound(w) else (a == 'a')
            yield valid, a, w

#note: you could use nltk to split text in paragraphs,sentences, words
pairs = ((a, w)
         for sentence in sys.stdin.readlines() if sentence.strip() 
         for valid, a, w in check_a_an_usage(nltk.wordpunct_tokenize(sentence))
         if not valid)

print("Invalid indefinite article usage:")
print('\n'.join(map(" ".join, pairs)))

示例输入(每行一个句子)

Validity is defined as `an` comes before a word that starts with a
vowel sound, otherwise `a` may be used.
Like "a house", but "an hour" or "a European" (from @Hyperboreus's comment http://stackoverflow.com/questions/20336524/gramatically-correct-an-english-text-python#comment30353583_20336524 ).
A AcRe, an AcRe, a rhYthM, an rhYthM, a yEarlY, an yEarlY (words from @tchrist's comment http://stackoverflow.com/questions/9505714/python-how-to-prepend-the-string-ub-to-every-pronounced-vowel-in-a-string#comment12037821_9505868 )
We have found a (obviously not optimal) solution." vs. "We have found an obvious solution (from @Hyperboreus answer)
Wait, I will give you an... -- he shouted, but dropped dead before he could utter the last word. (ditto)

输出

Invalid indefinite article usage:
a acre
an rhythm
an yearly

最后一对无效的原因并不明显,请参阅Why is it “an yearly”?

答案 1 :(得分:4)

也许这可以为您提供一个粗略的指导原则:

  1. 您需要将输入文本解析为韵律单位,因为我怀疑“a / an”的规则适用于韵律边界(例如“我们找到 a ”(显然不是最佳解决方案。“与”我们找到了 明显的解决方案“。

  2. 接下来你需要将每个韵律单元解析为语音单词。

  3. 现在你需要识别那些代表未定义文章的词语(“房子”与“A级产品”)。

  4. 一旦你确定了这些文章,请查看韵律单元中的下一个单词,并确定(这里是龙)这个单词的第一个音素的音节特征。

  5. 如果它有[+ syll],则该文章应为“an”。如果它有[-syll],那么文章应该是“a”。如果文章在韵律单元的末尾,它应该可能“a”(但椭圆是什么:“等等,我会给你一个...... - 他喊道,但是掉了下来在他说完最后一句话之前死了。“)。除了abanert,方言方差等提及的历史例外情况等

  6. 如果找到的文章与预期不符,请将此标记为“不正确”。


  7. 这里有一些伪代码:

    def parseProsodicUnits(text): #here be dragons
    def parsePhonologicalWords(unit): #here be dragons
    def isUndefinedArticle(word): #here be dragons
    def parsePhonemes(word): #here be dragons
    def getFeatures(phoneme): #here be dragons
    
    for unit in parseProsodicUnits(text):
        for idx, word in enumerate (parsePhonologicalWords(unit)[:-1]):
            if not isUndefinedArticle(word): continue
            syllabic = '+syll' in getFeatures(parsePhonemes(unit[idx+1])[0])
            if (word == 'a' and syllabic) or (word == 'an' and not syllabic):
                print ('incorrect')
    

答案 2 :(得分:1)

all_words = "this is an wonderful life".split()
for i in range(len(all_words)):
    if all_words[i].lower() in ["a","an"]:
       if all_words[i+1][0].lower() in "aeiou":
           all_words[i] = all_words[i][0]+"n"
       else:
           all_words[i] = all_words[i][0]
print " ".join(all_words)

应该让你开始,但它不是一个完整的解决方案......

答案 3 :(得分:1)

我可能会从以下方法开始:

exceptions = set(/*a whole bunch of exceptions*/)
article = None
for word in text.split():
    if article:
        vowel = word[0].lower() in "aeiou"
        if word.lower() in exceptions:
            vowel = not vowel
        if (article.lower() == "an" and not vowel) or (article.lower() == "a" and vowel):
            print "Misused article '%s %s'" % (article, word)
        article = None
    if word.lower() in ('a', 'an'):
       article = word