Question

使用GAE search API可以搜索部分匹配吗？

我正在尝试创建自动完成功能，其中该术语将是一个部分词。例如

＆GT; b
  ＆GT; BUI
  ＆GT;建立

将全部归还“建筑”。

GAE如何实现这一目标？

Answer 1

虽然全文搜索不支持LIKE语句（部分匹配），但你可能会破解它。

首先，为所有可能的子串标记数据字符串（hello = h，he，hel，lo等）

def tokenize_autocomplete(phrase):
    a = []
    for word in phrase.split():
        j = 1
        while True:
            for i in range(len(word) - j + 1):
                a.append(word[i:i + j])
            if j == len(word):
                break
            j += 1
    return a

使用标记化字符串

构建索引+文档（Search API）

index = search.Index(name='item_autocomplete')
for item in items:  # item = ndb.model
    name = ','.join(tokenize_autocomplete(item.name))
    document = search.Document(
        doc_id=item.key.urlsafe(),
        fields=[search.TextField(name='name', value=name)])
    index.put(document)

执行搜索，并执行walah！

results = search.Index(name="item_autocomplete").search("name:elo")

https://code.luasoftware.com/tutorials/google-app-engine/partial-search-on-gae-with-search-api/

Answer 2

就像@Desmond Lua一样回答，但是使用了不同的tokenize函数：

def tokenize(word):
  token=[]
  words = word.split(' ')
  for word in words:
    for i in range(len(word)):
      if i==0: continue
      w = word[i]
      if i==1: 
        token+=[word[0]+w]
        continue

      token+=[token[-1:][0]+w]

  return ",".join(token)

它会将hello world解析为he,hel,hell,hello,wo,wor,worl,world。

它有利于轻松自动完成目的

Answer 3

如Full Text Search and LIKE statement所述，不可能，因为Search API实现了全文索引。

希望这有帮助！

Answer 4

我对typeahead控件有同样的问题，我的解决方案是解析字符串到小部分：

name='hello world'
name_search = ' '.join([name[:i] for i in xrange(2, len(name)+1)])
print name_search;
# -> he hel hell hello hello  hello w hello wo hello wor hello worl hello world

希望这个帮助

Answer 5

我的版本优化：不重复令牌

def tokenization(text):
    a = []
    min = 3
    words = text.split()
    for word in words:
        if len(word) > min:
            for i in range(min, len(word)):
                token = word[0:i]
                if token not in a:
                    a.append(token)
    return a

Answer 6

在这里跳得很晚。

但这是我做过充分证明的函数，它可以进行标记化。该文档字符串应帮助您很好地理解和使用它。祝你好运！

def tokenize(string_to_tokenize, token_min_length=2):
  """Tokenizes a given string.

  Note: If a word in the string to tokenize is less then
  the minimum length of the token, then the word is added to the list
  of tokens and skipped from further processing.
  Avoids duplicate tokens by using a set to save the tokens.
  Example usage:
    tokens = tokenize('pack my box', 3)

  Args:
    string_to_tokenize: str, the string we need to tokenize.
    Example: 'pack my box'.
    min_length: int, the minimum length we want for a token.
    Example: 3.

  Returns:
    set, containng the tokenized strings. Example: set(['box', 'pac', 'my',
    'pack'])
  """
  tokens = set()
  token_min_length = token_min_length or 1
  for word in string_to_tokenize.split(' '):
    if len(word) <= token_min_length:
      tokens.add(word)
    else:
      for i in range(token_min_length, len(word) + 1):
        tokens.add(word[:i])
  return tokens

部分匹配GAE搜索API

6 个答案: