适用于快速搜索集合的数据结构。输入:标签,输出:句子

时间:2018-09-02 20:50:41

标签: python database search set

我有以下问题。

我得到1-10个与图像相关的标签,每个标签都有在图像中存在的可能性。

输入:海滩,女人,狗,树...

我想从数据库中检索与标签最相关的已经组成的句子。

例如:

海滩->“在海滩处有趣” /“在海滩处放松” ....

海滩,女人->“在海滩上的女人

海滩,女人,狗->找不到!

采取最接近的存在但考虑概率 可以说:女人0.95,沙滩0.85,狗0.7 因此,如果存在的话,拿女人+海滩(0.95,0.85),然后拿女人+狗和最后一个海滩+狗,顺序是越高越好,但我们不求和。

我考虑过使用python 设置,但我不确定如何使用。

另一个选项为defaultdict:

db ['beach'] ['woman'] ['dog'],但我也希望从以下内容中获得相同的结果: db ['woman'] ['beeach'] ['dog']

我想得到一个很好的解决方案。 谢谢。

编辑:有效的解决方案

from collections import OrderedDict
list_of_keys = []
sentences = OrderedDict()
sentences[('dogs',)] = ['I like dogs','dogs are man best friends!']
sentences[('dogs', 'beach')] = ['the dog is at the beach']
sentences[('woman', 'cafe')] = ['The woman sat at the cafe.']
sentences[('woman', 'beach')] = ['The woman was at the beach']
sentences[('dress',)] = ['hi nice dress', 'what a nice dress !']


def keys_to_list_of_sets(dict_):
    list_of_keys = []
    for key in dict_:
        list_of_keys.append(set(key))

    return list_of_keys

def match_best_sentence(image_tags):
    for i, tags in enumerate(list_of_keys):
        if (tags & image_tags) == tags:
            print(list(sentences.keys())[i])

list_of_keys = keys_to_list_of_sets(sentences)
tags = set(['beach', 'dogs', 'woman'])
match_best_sentence(tags)

结果:

('dogs',)
('dogs', 'beach')
('woman', 'beach')

此解决方案适用于有序词典的所有键, o(n),我希望看到任何性能上的改进。

2 个答案:

答案 0 :(得分:1)

在不使用DB的情况下,最简单的方法是保留每个单词的集合并采用交集。

更明确地:

如果句子中包含“女人”一词,则将其放入“女人”集中。对于狗和沙滩等,每个句子类似。这意味着您的空间复杂度为O(sentences * average_tags),因为每个句子在数据结构中重复出现。

您可能有:

>>> dogs = set(["I like dogs", "the dog is at the beach"])
>>> woman = set(["The woman sat at the cafe.", "The woman was at the beach"])
>>> beach = set(["the dog is at the beach", "The woman was at the beach", "I do not like the beach"])
>>> dogs.intersection(beach)
{'the dog is at the beach'}

您可以将其内置到defaultdict顶部的对象中,以便可以获取标签列表,并且只能与这些列表相交并返回结果。

粗略的实现思路:

from collections import defaultdict
class myObj(object): #python2
    def __init__(self):
        self.sets = defaultdict(lambda: set()) 

    def add_sentence(self, sentence, tags):
         #how you process tags is up to you, they could also be parsed from
         #the input string. 
         for t in tags:
             self.sets[tag].add(sentence)

    def get_match(self, tags):
         result = self.sets(tags[0]) #this is a hack 
         for t in tags[1:]:
             result = result.intersection(self.sets[t])

         return result #this function can stand to be improved but the idea is there

也许这将使它更加清晰,默认的dict和set将如何最终在对象中查找。

>>> a = defaultdict(lambda: set())
>>> a['woman']
set([])
>>> a['woman'].add(1)
>>> str(a)
"defaultdict(<function <lambda> at 0x7fcb3bbf4b90>, {'woman': set([1])})"
>>> a['beach'].update([1,2,3,4])
>>> a['woman'].intersection(a['beach'])
set([1])
>>> str(a)
"defaultdict(<function <lambda> at 0x7fcb3bbf4b90>, {'woman': set([1]), 'beach': set([1, 2, 3, 4])})"

答案 1 :(得分:0)

它主要取决于数据库的大小以及关键字之间的组合数量。而且,这还取决于您最常执行哪种操作。
 如果它很小,并且您需要快速进行find操作,则可以使用带有frozensets作为关键字的字典,该字典包含标签和所有相关句子的值。

例如,

d=defaultdict(list)
# preprocessing
d[frozenset(["bob","car","red"])].append("Bob owns a red car")

# searching
d[frozenset(["bob","car","red"])]  #['Bob owns a red car']
d[frozenset(["red","car","bob"])]  #['Bob owns a red car']

对于诸如“ bob”,“ car”之类的单词组合,根据关键字的数量以及更重要的内容,您会有不同的可能性。例如

  • 对于每个组合,您可能还会有其他条目
  • 您可以遍历键并检查同时包含 car bob
  • 的键