Question

我正试图从教育问题记录列表中筛选出一些常见的标签组合。

对于这个例子，我只关注2-tag示例（tag-tag），我应该得到一个结果示例： “点”+“曲线”（65个条目） “添加”+“减去”（40个条目） ...

这是SQL语句中的理想结果：

SELECT a.tag, b.tag, count(*)
FROM examquestions.dbmanagement_tag as a
INNER JOIN examquestions.dbmanagement_tag as b on a.question_id_id = b.question_id_id
where a.tag != b.tag
group by a.tag, b.tag

基本上我们会将不同的标签与常见问题一起识别到列表中，并将它们分组到相同的匹配标签组合中。

我尝试使用django queryset执行类似的查询：

    twotaglist = [] #final set of results

    alphatags = tag.objects.all().values('tag', 'type').annotate().order_by('tag')
    betatags = tag.objects.all().values('tag', 'type').annotate().order_by('tag')
    startindex = 0 #startindex reduced by 1 to shorten betatag range each time the atag changes. this is to reduce the double count of comparison of similar matches of tags
    for atag in alphatags:
        for btag in betatags[startindex:]:
            if (atag['tag'] != btag['tag']):
                commonQns = [] #to check how many common qns
                atagQns = tag.objects.filter(tag=atag['tag'], question_id__in=qnlist).values('question_id').annotate()
                btagQns = tag.objects.filter(tag=btag['tag'], question_id__in=qnlist).values('question_id').annotate()
                for atagQ in atagQns:
                    for btagQ in btagQns:
                        if (atagQ['question_id'] == btagQ['question_id']):
                            commonQns.append(atagQ['question_id'])
                if (len(commonQns) > 0):
                    twotaglist.append({'atag': atag['tag'],
                                        'btag': btag['tag'],
                                        'count': len(commonQns)})
        startindex=startindex+1

逻辑工作正常，但由于我对这个平台很陌生，我不确定是否有更短的解决方法，而不是提高效率。

目前，在大约5K X 5K标签比较中查询需要大约45秒:(

插件：标签类

class tag(models.Model):
    id = models.IntegerField('id',primary_key=True,null=False)
    question_id = models.ForeignKey(question,null=False)
    tag = models.TextField('tag',null=True)
    type = models.CharField('type',max_length=1)

    def __str__(self):
        return str(self.tag)

Answer 1

如果我正确地理解了你的问题，我会让事情更简单并做一些类似的事情

relevant_tags = Tag.objects.filter(question_id__in=qnlist)
#Here relevant_tags has both a and b tags

unique_tags = set()
for tag_item in relevant_tags:
    unique_tags.add(tag_item.tag)

#unique_tags should have your A and B tags

a_tag = unique_tags.pop()
b_tag = unique_tags.pop() 

#Some logic to make sure what is A and what is B

a_tags = filter(lambda t : t.tag == a_tag, relevant_tags)
b_tags = filter(lambda t : t.tag == b_tag, relevant_tags)

#a_tags and b_tags contain A and B tags filtered from relevant_tags

same_question_tags = dict()

for q in qnlist:
  a_list = filter(lambda a: a.question_id == q.id, a_tags)
  b_list = filter(lambda a: a.question_id == q.id, b_tags)
  same_question_tags[q] = a_list+b_list

关于这一点的好处是你可以通过在循环中迭代返回的标签来扩展它到N个标签，以获得所有唯一的标签，然后进一步迭代以标记过滤它们。

肯定有更多方法可以做到这一点。

Answer 2

不幸的是，除非涉及外键（或一对一），否则django不允许加入。您将不得不在代码中执行此操作。我找到了一种方法（完全未经测试）使用单个查询来完成它，这可以显着改善执行时间。

from collections import Counter
from itertools import combinations

# Assuming Models
class Question(models.Model):
    ...

class Tag(models.Model):
    tag = models.CharField(..)
    question = models.ForeignKey(Question, related_name='tags')

c = Counter()
questions = Question.objects.all().prefetch_related('tags') # prefetch M2M
for q in questions:
    # sort them so 'point' + 'curve' == 'curve' + 'point'
    tags = sorted([tag.name for tag in q.tags.all()])
    c.update(combinations(tags,2)) # get all 2-pair combinations and update counter
c.most_common(5) # show the top 5

上面的代码使用Counters，itertools.combinations和django prefetch_related，它们应涵盖可能未知的大部分位数。如果上面的代码不能正常工作，请查看这些资源，并相应地进行修改。

如果您未在Question型号上使用M2M字段，则仍可以使用reverse relations访问标记，就像它是M2M字段一样。请参阅我的修改，该修改会改变从tag_set到tags的反向关系。我做了一些其他的编辑，这些编辑应该与你定义模型的方式一起使用。

如果您未指定related_name='tags'，则只需更改过滤器中的tags并将prefetch_related更改为tag_set即可。

Django Queryset：需要帮助优化这组查询

2 个答案: