Question

我是编程新手，我正在运行此脚本来清理大型文本文件（超过12000行）并将其写入另一个.txt文件。问题是当一个较小的文件（大约500行）运行它执行得很快，因此我的结论是由于文件的大小需要时间。因此，如果有人可以指导我使这个代码有效，我们将非常感激。

input_file = open('bNEG.txt', 'rt', encoding='utf-8')
    l_p = LanguageProcessing()
    sentences=[]
    for lines in input_file.readlines():
        tokeniz = l_p.tokeniz(lines)
        cleaned_url = l_p.clean_URL(tokeniz)
        remove_words = l_p.remove_non_englishwords(cleaned_url)
        stopwords_removed = l_p.remove_stopwords(remove_words)
        cleaned_sentence=' '.join(str(s) for s in stopwords_removed)+"\n"
        output_file = open('cNEG.txt', 'w', encoding='utf-8')
        sentences.append(cleaned_sentence)
        output_file.writelines(sentences)
    input_file.close()
    output_file.close()

编辑：以下是答案中提到的更正后的代码，其他一些更改都符合我的要求

input_file = open('chromehistory_log.txt', 'rt', encoding='utf-8')
    output_file = open('dNEG.txt', 'w', encoding='utf-8')
    l_p = LanguageProcessing()
    #sentences=[]
    for lines in input_file.readlines():
        #print(lines)
        tokeniz = l_p.tokeniz(lines)
        cleaned_url = l_p.clean_URL(tokeniz)
        remove_words = l_p.remove_non_englishwords(cleaned_url)
        stopwords_removed = l_p.remove_stopwords(remove_words)
        #print(stopwords_removed)
        if stopwords_removed==[]:
            continue
        else:
            cleaned_sentence=' '.join(str(s) for s in stopwords_removed)+"\n"

        #sentences.append(cleaned_sentence)
        output_file.writelines(cleaned_sentence)
    input_file.close()
    output_file.close()

Answer 1

将讨论作为答案：

这里有两个问题：

打开/创建输出文件并在循环中写入数据 - 对于输入文件的每一行。另外，你收集数组中的所有数据（句子）。

您有两种可能性：

a）在循环之前创建文件，然后在循环中写入＆＃34; cleaning_sentence＆＃34; （并删除收集＆＃34;句子＆＃34;）。

b）收集＆＃34;句子中的所有内容＆＃34;写＆＃34;句子＆＃34;在循环之后立刻。

a）的缺点是：这比b）慢一点（只要OS di不必为b替换内存）。但优势在于：无论文件大小多少以及计算机内存的安装量减少，这都会减少内存消耗。

使写入文件进程更有效率

1 个答案: