Question

我想从cloudsearch中检索所有可搜索的文档

我试图做一个负面搜索：

search-[mySearchEndPoint].cloudsearch.amazonaws.com/2011-02-01/search?bq=(not keywords: '!!!testtest!!!')

它可以工作，但它也会返回所有已删除的文档。

那么我怎样才能获得所有活动文档？

Answer 1

要知道的关键是CloudSearch并没有真正删除。相反，“删除”功能会在索引中保留ID，但会清除这些已删除文档中的所有字段，包括将uint字段设置为0.这对于正面查询很有效，这些查询将与已清除的“已删除”文档中的文本不匹配。

解决方法是在您的文档中添加一个uint字段，在下面称为“已更新”，以用作可能返回已删除ID的查询的过滤器，例如否定查询。

（以下示例使用Boto interface library for CloudSearch，为简洁起见省略了许多步骤。）

添加文档时，请将字段设置为当前时间戳

doc['updated'] = now_utc  # unix time in seconds; useful for 'version' also.
doc_service.add(id, now_utc, doc)
conn.commit()

删除时，CloudSearch将uint字段设置为0：

doc_service.delete(id, now_utc)
conn.commit()
# CloudSearch sets doc's 'updated' field = 0

现在，您可以在否定查询中区分已删除和活动的文档。下面的示例是使用86个文档搜索测试索引，其中大约一半被删除。

# negative query that shows both active and deleted IDs
neg_query = "title:'-foobar'"
results = search_service.search(bq=neg_query)
results.hits  # 86 docs in a test index

# deleted items
deleted_query = "updated:0"
results = search_service.search(bq=deleted_query)
results.hits  # 46 of them have been deleted

# negative, filtered query that lists only active
filtered_query = "(and updated:1.. title:'-foobar')"
results = search_service.search(bq=filtered_query)
results.hits  # 40 active docs

Answer 2

我认为你可以这样做：

search-[mySearchEndPoint].cloudsearch.amazonaws.com/2011-02-01/search?bq=-impossibleTermToSearch

在术语开头注意' - '

如何在Amazon cloudsearch中检索所有可搜索（未删除）的文档

2 个答案: