Question

我有以下问题。

我得到1-10个与图像相关的标签，每个标签都有在图像中存在的可能性。

输入：海滩，女人，狗，树...

我想从数据库中检索与标签最相关的已经组成的句子。

例如：

海滩->“在海滩处有趣” /“在海滩处放松” ....

海滩，女人->“在海滩上的女人”

海滩，女人，狗->找不到！

采取最接近的存在但考虑概率可以说：女人0.95，沙滩0.85，狗0.7 因此，如果存在的话，拿女人+海滩（0.95，0.85），然后拿女人+狗和最后一个海滩+狗，顺序是越高越好，但我们不求和。

我考虑过使用python 设置，但我不确定如何使用。

另一个选项为defaultdict：

db ['beach'] ['woman'] ['dog']，但我也希望从以下内容中获得相同的结果： db ['woman'] ['beeach'] ['dog']

我想得到一个很好的解决方案。谢谢。

编辑：有效的解决方案

from collections import OrderedDict
list_of_keys = []
sentences = OrderedDict()
sentences[('dogs',)] = ['I like dogs','dogs are man best friends!']
sentences[('dogs', 'beach')] = ['the dog is at the beach']
sentences[('woman', 'cafe')] = ['The woman sat at the cafe.']
sentences[('woman', 'beach')] = ['The woman was at the beach']
sentences[('dress',)] = ['hi nice dress', 'what a nice dress !']


def keys_to_list_of_sets(dict_):
    list_of_keys = []
    for key in dict_:
        list_of_keys.append(set(key))

    return list_of_keys

def match_best_sentence(image_tags):
    for i, tags in enumerate(list_of_keys):
        if (tags & image_tags) == tags:
            print(list(sentences.keys())[i])

list_of_keys = keys_to_list_of_sets(sentences)
tags = set(['beach', 'dogs', 'woman'])
match_best_sentence(tags)

结果：

('dogs',)
('dogs', 'beach')
('woman', 'beach')

此解决方案适用于有序词典的所有键， o（n），我希望看到任何性能上的改进。

Answer 1

在不使用DB的情况下，最简单的方法是保留每个单词的集合并采用交集。

更明确地：

如果句子中包含“女人”一词，则将其放入“女人”集中。对于狗和沙滩等，每个句子类似。这意味着您的空间复杂度为O（sentences * average_tags），因为每个句子在数据结构中重复出现。

您可能有：

>>> dogs = set(["I like dogs", "the dog is at the beach"])
>>> woman = set(["The woman sat at the cafe.", "The woman was at the beach"])
>>> beach = set(["the dog is at the beach", "The woman was at the beach", "I do not like the beach"])
>>> dogs.intersection(beach)
{'the dog is at the beach'}

您可以将其内置到defaultdict顶部的对象中，以便可以获取标签列表，并且只能与这些列表相交并返回结果。

粗略的实现思路：

from collections import defaultdict
class myObj(object): #python2
    def __init__(self):
        self.sets = defaultdict(lambda: set()) 

    def add_sentence(self, sentence, tags):
         #how you process tags is up to you, they could also be parsed from
         #the input string. 
         for t in tags:
             self.sets[tag].add(sentence)

    def get_match(self, tags):
         result = self.sets(tags[0]) #this is a hack 
         for t in tags[1:]:
             result = result.intersection(self.sets[t])

         return result #this function can stand to be improved but the idea is there

也许这将使它更加清晰，默认的dict和set将如何最终在对象中查找。

>>> a = defaultdict(lambda: set())
>>> a['woman']
set([])
>>> a['woman'].add(1)
>>> str(a)
"defaultdict(<function <lambda> at 0x7fcb3bbf4b90>, {'woman': set([1])})"
>>> a['beach'].update([1,2,3,4])
>>> a['woman'].intersection(a['beach'])
set([1])
>>> str(a)
"defaultdict(<function <lambda> at 0x7fcb3bbf4b90>, {'woman': set([1]), 'beach': set([1, 2, 3, 4])})"

Answer 2

它主要取决于数据库的大小以及关键字之间的组合数量。而且，这还取决于您最常执行哪种操作。
如果它很小，并且您需要快速进行find操作，则可以使用带有frozensets作为关键字的字典，该字典包含标签和所有相关句子的值。

例如，

d=defaultdict(list)
# preprocessing
d[frozenset(["bob","car","red"])].append("Bob owns a red car")

# searching
d[frozenset(["bob","car","red"])]  #['Bob owns a red car']
d[frozenset(["red","car","bob"])]  #['Bob owns a red car']

对于诸如“ bob”，“ car”之类的单词组合，根据关键字的数量以及更重要的内容，您会有不同的可能性。例如

对于每个组合，您可能还会有其他条目
您可以遍历键并检查同时包含 car 和 bob

适用于快速搜索集合的数据结构。输入：标签，输出：句子

2 个答案: