从Python列表中评估和删除重复的dicts

时间:2014-07-22 13:31:19

标签: python

商业问题:我有一份代表学生学习历史的词汇表......他们所学的课程,学习时间,成绩等级(空白表示课程在以下情况)我需要找到给定班级的任何重复尝试,并且只保留最高等级的尝试。

到目前为止我尝试了什么

acad_hist = [{‘crse_id’: u'GRG 302P0', ‘grade’: u’’}, {‘crse_id’: u’URB 3010', ‘grade’: u’B+‘},
{‘crse_id’: u'GRG 302P0', ‘grade’: u’D‘}]

grade_list = ['CR', 'D-', 'D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+']
  1. 首先,我尝试遍历acad_hist列表,并将尚未看到的任何类添加到“看到”列表中。那时的计划是,当我遇到一个已经被添加到“看到”列表的课程时,我应该回到acad_hist列表,获取该课程的详细信息(例如"等级") ,评估成绩,并从acad_hist列表中删除低年级的班级。问题是,我很难轻松回过头来从“看到”列表中“抓住”早先看到的类,一旦我知道我需要从acad_hist列表中删除它,就更难以正确指向它。代码是乱七八糟的,但到目前为止,这是我所拥有的:

    key = ‘crse_id’
    for index, course in enumerate(acad_hist[:]):
        if course[key] not in seen:
            seen.append(course[key])
        else:
            logger.info('found duplicate {0} at index {1}'.format(course[key], index))
            < not sure what to do here… >
    

    输出

    found duplicate GRG 302P0 at index 11
    
  2. 那么我想我可以使用set()函数为我剔除列表,但问题是我需要选择要保留的类实例和set()不似乎让我有办法做到这一点。

    names = set(d['compressed_hist_crse_id'] for d in acad_hist_condensed)
    logger.info('TEST names: {0}'.format(names))
    

    输出

    TEST names: set([u'GRG 302P0', u'URB 3010’}]
    
  3. 想知道我是否可以添加到上面的#2,我想我会做一些“腰带 - 吊带”循环通过set()“names”的输出并收集一个等级。它工作正常,但我并没有假装完全理解它在做什么,也不能真正让我做我需要做的处理。

    new_dicts = []
    for name in names:
        d = dict(name=name)
        d['grade'] = max(d['grade'] for d in acad_hist if d['crse_id'] == name)
        new_dicts.append(d)
    logger.info('TEST new_dicts: {0}'.format(new_dicts))
    

    输出

    TEST new_dicts: [{'grade': u'', 'name': u'GRG 302P0'}, {'grade': u’B’+, 'name': u'URB 3010'}]
    
  4. 任何人都可以向我提供缺失的部分,甚至更好的方法吗?

    更新 - 我最终得到的解决方案(根据我接受的答案改编的想法)

    def scrub_for_duplicate_courses(acad_hist_condensed, acad_hist_list):
    """
    Looks for duplicate courses that may have been taken, and if any are found, will look for the one with the highest
    grade and keep that one, deleting the other course from the lists before returning them.
    """
    
    # -------------------------------------------
    # set logging params
    # -------------------------------------------
    logger = logging.getLogger(__name__)
    
    # -----------------------------------------------------------------------------------------------------
    # the grade_list is in order of ascending priority/value...a blank grade indicates "in-progress", and
    # will therefore replace any class instance that has a grade.
    # -----------------------------------------------------------------------------------------------------
    grade_list = ['CR', 'D-', 'D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+', '']
    # converting the grade_list in to a more efficient, weighted dict
    grade_list = dict(zip(grade_list, range(len(grade_list))))
    
    seen_courses = {}
    
    for course in acad_hist_condensed[:]:
        # -----------------------------------------------------------------------------------------------------
        # one of the two keys checked for below should exist in the list, but not both
        # -----------------------------------------------------------------------------------------------------
        key = ''
        if 'compressed_hist_crse_id' in course:
            key = 'compressed_hist_crse_id'
        elif 'compressed_ovrd_crse_id' in course:
            key = 'compressed_ovrd_crse_id'
    
        cid = course[key]
        grade = course['grade']
    
        if cid not in seen_courses:
            seen_courses[cid] = grade
        else:
            # ---------------------------------------------------------------------------------------------------------
            # if we get here, a duplicate course_id has been found in the acad_hist_condensed list, so now we'll want
            # to determine which one has the lowest grade, and remove that course instance from both lists.
            # ---------------------------------------------------------------------------------------------------------
            if grade_list.get(seen_courses[cid], 0) < grade_list.get(grade, 0):
                seen_courses[cid] = grade  # this will overlay the grade for the record already in seen_courses
                grade_for_rec_to_remove = seen_courses[cid]
                crse_id_for_rec_to_remove = cid
            else:
                grade_for_rec_to_remove = grade
                crse_id_for_rec_to_remove = cid
    
            # -----------------------------------------------------------------------------------------------------
            # find the rec in acad_hist_condensed that needs removal
            # -----------------------------------------------------------------------------------------------------
            for rec in acad_hist_condensed:
                if rec[key] == crse_id_for_rec_to_remove and rec['grade'] == grade_for_rec_to_remove:
                    acad_hist_condensed.remove(rec)
            for rec in acad_hist_list:
                if rec == crse_id_for_rec_to_remove:
                    acad_hist_list.remove(rec)
                    break  # just want to remove one occurrence
    
    return acad_hist_condensed, acad_hist_list
    

2 个答案:

答案 0 :(得分:1)

一个简单的解决方案是迭代每个学生的课程历史并计算每门课程的最高分数......

acad_hist = [{'crse_id': u'GRG 302P0', 'grade': u''}, {'crse_id': u'URB 3010', 'grade': u'B+'}, {'crse_id': u'GRG 302P0', 'grade': u'D'}]

grade_list = ['CR', 'D-', 'D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+']
#let's turn grade_list into something more efficient:
grade_list = dict(zip(grade_list, range(len(grade_list)))) # 'CR' == 0, 'D-' == 1

courses = {} # keys will be crse_id, values will be grade.
for course in acad_hist:
    cid = course['crse_id']
    g = course['grade']
    if cid not in courses:
        courses[cid] = g 
    else:
        if grade_list.get(courses[cid], 0) < grade_list.get(g,0):
            courses[cid] = g 

输出结果为:

{u'GRG 302P0': u'D', u'URB 3010': u'B+'}

如果需要,可以将其重写为原始形式

答案 1 :(得分:1)

这可以使用迭代器乐高(即ifiltersortedgroupbymax

来完成
def find_best_grades(history):
    def course(course_grade):
        return course_grade['crse_id']
    def grade(course_grade):
        return GRADES[course_grade['grade']]
    def has_grade(course_grade):
        return bool(course_grade['grade'])

    # 1) Remove course grades without grades.
    # 2) Sort the history so that grades for the same course are
    #    consecutive (this allows groupby to work).
    # 3) Group grades for the same course together.
    # 4) Use max to select the high grade obtains for a course.

    return [max(course_grades, key=grade)
            for _, course_grades in
            groupby(sorted(ifilter(has_grade, history), key=course),
                    key=course)]

枯燥的完整代码

from itertools import groupby, ifilter


COURSE_ID = 'crse_id'
GRADE = 'grade'

ACADEMIC_HISTORY = [
    {
        COURSE_ID: 'GRG 302P0',
        GRADE    : 'B',
    },
    {
        COURSE_ID: 'GRG 302P0',
        GRADE    : '',
    },
    {
        COURSE_ID: 'URB 3010',
        GRADE    : 'B+',
    },
    {
        COURSE_ID: 'GRG 302P0',
        GRADE    : 'D',
    },
]

GRADES = [
    'CR',
    'D-',
    'D' ,
    'D+',
    'C-',
    'C' ,
    'C+',
    'B-',
    'B' ,
    'B+',
    'A-',
    'A' ,
    'A+',
]

GRADES = dict(zip(GRADES, range(len(GRADES))))


def find_best_grades(history):
    def course(course_grade):
        return course_grade['crse_id']
    def grade(course_grade):
        return GRADES[course_grade['grade']]
    def has_grade(course_grade):
        return bool(course_grade['grade'])

    # 1) Remove course grades without grades.
    # 2) Sort the history so that grades for the same course are
    #    consecutive (this allows groupby to work).
    # 3) Group grades for the same course together.
    # 4) Use max to select the high grade obtains for a course.

    return [max(course_grades, key=grade)
            for _, course_grades in
            groupby(sorted(ifilter(has_grade, history), key=course),
                    key=course)]

best_grades = find_best_grades(ACADEMIC_HISTORY)
print best_grades