迭代多个文本文件并进行比较

时间:2016-09-19 23:57:09

标签: python list file iteration

我正在尝试编写一个函数,将文本文件放入列表中,然后遍历文件以查找精确和部分副本,以清除那些可能通过贬低他们的工作而作弊的人。我首先使用我的班级名单并将.txt添加到他们的名字中以查找他们的作业以及他们是否甚至完成了作业。我有超过500个学生的论文要读。到目前为止,我编写的代码在.txt文件中逐字逐句地进行迭代,所以我得到许多“被欺骗”的东西。请帮助。

def Cheaters():
    file = open("roster.txt", "r")
    L = []
    for i in file:
        new = [i[:-1], ".txt"]
        new2 = "".join(new)
        if i not in L:
            L.append(new2)
    for j in L:
        try:
            file2 = open(j, "r")
            for n in file2:
                for m in file2:
                    if n == m:
                        print("Cheated")
        except:
            print("No work submitted")

1 个答案:

答案 0 :(得分:0)

试试这个。您可能需要根据文件结构对其进行修改,但它应该很接近。

import re
from itertools import product

def hash_sentences(document):
    # remove all characters except those below, replace with a space
    # split into a list
    cleaned_text = re.sub(r'[^A-z0-9,;:\.\?! ]', ' ', document)
    sentences = re.split(r'[\?.!\.]', cleaned_text)

    # the less than 5 removes short sentences like "Dr."
    # return a hash of the sentences for comparison
    return [hash(s.strip().lower()) for s in sentences if len(s) > 5]  

def compare_documents(doc1, doc2):
    hash1 = hash_sentences(doc1)
    hash2 = hash_sentences(doc2)
    # return the percentage of sentences of doc1 that are in doc2
    return sum((h in hash2) for h in hash1) / float(len(hash1))

# get list of document file names
with open('roster.txt', 'r') as fp:
    doc_fnames = [d+'.txt' for d in fp.readlines()]

# create dictionay of file names and content
doc_dict = {}
for fname in doc_fnames:
    try:
        with open(fname, 'r') as fp:
            doc_dict[fname] = fp.read()
    except:
        print('No submission: %s' %fname)

# iterate through the pairs of documents
for doc_pair in product(doc_dict.keys(), doc_dict.keys()):
    pct = compare_documents(doc_dict[doc_pair[0]], doc_dict[doc_pair[1]])
    print('Percentage of %s sentences in %s: %0.2f%%' %(doc_pair[0], doc_pair[1], 100*pct))