Question

我是Python的新手，这是我第一次尝试应用我所学到的东西，但我知道我效率低下。代码可以工作，但是在一个新颖的文本文件上完成执行需要几分钟的时间。

是否有更有效的方法来达到相同的输出？任何造型批评也将受到赞赏。谢谢！

def realWords(inFile, dictionary, outFile):
    with open(inFile, 'r') as inf, open(dictionary, 'r') as dictionary, open(outFile, 'w') as outf:
    realWords = ''
    dList = []
    for line in dictionary:
        dSplit = line.split()
        for word in dSplit:
            dList.append(word)
    for line in inf:
        wordSplit = line.split()
        for word in wordSplit:
            if word in dList:
                realWords += word + ' '
    outf.write(realWords)
    print('File of real words created')
    inf.close()
    dictionary.close()
    outf.close()

'''
I created a function to compare the words in a text file to real words taken 
from a reference dictionary (like the Webster Unabridged Dictionary). It 
takes a text file and breaks it up into individual word components. It then 
compares each word to each word in the reference dictionary text file in 
order to test whether the world is a real word or not. This is done so as to 
eliminate non-real words, names, and some other junk. For each word that 
passes the test, each word is then added to the same empty string. Once all 
words have been parsed, the output string containing all real words is 
written to a new text file.
'''

Answer 1

对于小说中的每个单词，您只需搜索整个词典一次，看看是否能找到该单词。那真的很慢。

您可以从使用set（）数据结构中受益，可以让您在恒定时间内快速确定元素是否在其中。

此外，通过删除字符串连接并使用.join（）代替，您可以更快地加速代码。

我对你的代码做了一些调整，所以它使用set（）和.join（），这应该会大大加快它的速度

def realWords(inFile, dictionary, outFile):
    with open(inFile, 'r') as inf, open(dictionary, 'r') as dictionary, open(outFile, 'w') as outf:
    realWords = [] #note list for constant time appends
    dList = set()
    for line in dictionary:
        dSplit = line.split()
        for word in dSplit:
        dList.add(word)
    for line in inf:
        wordSplit = line.split()
        for word in wordSplit:
            if word in dList: #done in constant time because dList is a set
                realWords.append(word)
    outf.write(' '.join(realWords))
    print('File of real words created')
    inf.close()
    dictionary.close()
    outf.close()

Answer 2

您可以使用set()快速查找单词，并且可以使用" ".join(your_list)来增加字符串连接速度，例如：

def write_real_words(in_file, dictionary, out_file):
    with open(in_file, 'r') as i, open(dictionary, 'r') as d, open(out_file, 'w') as o:
        dictionary_words = set()
        for l in d:
            dictionary_words |= set(l.split())
        real_words = [word for l in i for word in l.split() if word in dictionary_words]
        o.write(" ".join(real_words))
        print('File of real words created')

至于样式，上面的内容大多与PEP兼容，我缩短了变量名称以避免在SO中的代码块上滚动，我建议你使用更具描述性的东西来实际使用。

Answer 3

我写了一个可能的回复。我的主要评论是：

1）更多地模块化功能;也就是说，每个函数应该做更少的事情（即应该做得很好）。函数realWords只能在您要执行的非常具体的情况下重复使用完全符合您的建议。下面的函数做的事情较少，因此它们更有可能被重用。

2）我添加了从单词中删除特殊字符的功能，以避免类型II错误（即避免错过真正的单词并将其称为无意义）

3）我添加了一些功能来存储指定为不是真正单词的所有内容。这个工作流程的主要QC步骤是迭代检查进入＆＃34;废话＆＃34;类别并系统地消除错过的真实单词。

4）将真实的单词作为set存储在字典中，以保证最短的查找时间。

5）我没有运行这个，因为我没有适当的输入文件，所以我可能会有一些拼写错误或错误。

# real words could be missed if they adjoin a special character. strip all incoming words of special chars
def clean_words_in_line(input_line):
""" iterate through a line, remove special characters, return clean words"""
        chars_to_strip=[":", ";", ",", "."] # add characters as need be to remove them
        for dirty_word in input_line:
                for char in chars_to_strip: 
                        clean_word=dirty_word.strip(char)
                        clean_words.append(dirty_word)
        return(clean_words)

def ref_words_to_set(dct_file):
""" iterate through a source file containing known words, build a list of real words, return as a set """
        clean_word_list=[]
        with open(dct_file, 'r') as dt_fh:
                for line in dt_fh:
                        line=line.strip().split()
                        clean_line=clean_words_in_line(line)
                        for word in clean_line:
                                clean_word_list.append(word)
        clean_word_set=set(clean_word_list) # use a set comprehension to minimize lookup time 
        return(clean_word_set)

def find_real_words(my_novel, cws):
""" iterate through a book or novel, check for clean words """
        words_in_dict=[]
        quite_possibly_runcible=[]
        with open(my_novel) as mn_fh:
                for line in my_novel:
                        line=line.strip().split()
                        clean_line=clean_words_in_line(line)
                        for word in clean_line:
                                if word in cws:
                                        words_in_dict.append(word)
                                else:
                                        quite_possibly_runcible.append(word)
        return(words_in_dict, quite_possibly_runcible)


set_of_real_words=ref_words_to_set("The_Webster_Unabridged_Dictionary.txt")
(real_words, non_sense)=find_real_words("Don_Quixote.txt", set_of_real_words)

with open("Verified_words.txt", 'a') as outF:
        outF.write(" ".join(real_words) + "\n")

with open("Lears_words.txt", 'a') as n_outF:
        n_outF.write(" ".join(non_sense) + "\n")

Answer 4

这个答案是为了理解，而不仅仅是提供更好的代码。

您需要做的是学习Big O notation。

阅读字典的复杂性为O(number of lines in dictionary * number of words per line)，或仅O(number of words in dictionary)。

首先，阅读inf 的复杂性与类似。然而，惯用的python包括欺骗性的做法 - 即，if word in dList不是某些类型的固定时间操作。另外，python语言在这里需要+=的新对象（尽管在有限的情况下，它可以优化它 - 但不依赖于它），因此复杂性等于O(length of realWords)。假设大多数单词实际上都在字典中，这相当于文件的长度。

因此，此步骤的总体复杂度为O(number of words in infile * number of words in dictionary)优化，或O((number of words in infile)² * number of words in dictionary)没有优化。

由于第一步的复杂性越来越小，组件越来越少，整体复杂性只是后半部分的复杂性。

其他答案会给你一个O(number of words in dictionary + number of words in file)的复杂性，这是不可简化的，因为+的两侧是无关的。当然，这假定没有哈希冲突，但只要您的词典不是用户输入，这是一个安全的假设。（如果这样做，请抓住blist package from pypi以获得具有良好最差情况的方便容器。）

如何创建一种更有效的方法来解析两个大文本文件之间的单词（Python 3.6.4）

4 个答案: