在golang中实现全文搜索的有效方法

时间:2019-04-06 18:46:46

标签: go binary-search-tree trie

我试图在golang中实现一个简单的全文本搜索,但是我所有的实现都太慢而无法克服阈值。

任务如下:

  • 文档是用空格分隔的小写单词的非空字符串

  • 每个文档的隐式标识符等于其在输入数组中的索引

  • New()构造索引

  • Search():接受一个查询,该查询也是一串用空格分隔的小写单词,并返回一个包含文档唯一标识符的排序数组,该文档包含查询中所有单词的大小顺序

示例:

index := New([]string{
"this is the house that jack built",  //: 0
"this is the rat that ate the malt",  //: 1
})

index.Search("")  // -> []
index.Search("in the house that jack built")  // -> []
index.Search("malt rat")  // -> [1]
index.Search("is this the")  // -> [0, 1]

我已经尝试实现:

  • 针对每个文档以及所有文档的二进制搜索树

  • 每个文档和所有文档的特里树(前缀树)

  • 倒排索引搜索

二进制搜索树(用于所有文档):

type Tree struct {
    m           map[int]bool
    word        string
    left        *Tree
    right       *Tree
}

type Index struct {
    tree *Tree
}

二进制搜索树(每个文档的树):

type Tree struct {
    word  string
    left  *Tree
    right *Tree
}

type Index struct {
    tree  *Tree
    index int
    next  *Index
}

trie(对于所有文档):

type Trie struct {
    m        map[uint8]*Trie
    end_node map[int]bool
}

type Index struct {
    trie *Trie
}

trie(针对每个文档):

type Trie struct {
    m        map[uint8]*Trie
    end_node bool
}

type Index struct {
    trie  *Trie
    index int
    next  *Index
}

倒排索引:

type Index struct {
    m map[string]map[int]bool
}

反向索引的新增和搜索实现:

// New creates a fulltext search index for the given documents
func New(docs []string) *Index {
    m := make(map[string]map[int]bool)

    for i := 0; i < len(docs); i++ {
        words := strings.Fields(docs[i])
        for j := 0; j < len(words); j++ {
            if m[words[j]] == nil {
                m[words[j]] = make(map[int]bool)
            }
            m[words[j]][i+1] = true
        }
    }
    return &(Index{m})
}

// Search returns a slice of unique ids of documents that contain all words from the query.
func (idx *Index) Search(query string) []int {
    if query == "" {
        return []int{}
    }
    ret := make(map[int]bool)
    arr := strings.Fields(query)
    fl := 0
    for i := range arr {
        if idx.m[arr[i]] == nil {
            return []int{}
        }
        if fl == 0 {
            for value := range idx.m[arr[i]] {
                ret[value] = true
            }
            fl = 1
        } else {
            tmp := make(map[int]bool)
            for value := range ret {
                if idx.m[arr[i]][value] == true {
                    tmp[value] = true
                }
            }
            ret = tmp
        }
    }
    ret_arr := []int{}
    for value := range ret {
        ret_arr = append(ret_arr, value-1)
    }
    sort.Ints(ret_arr)
    return ret_arr
}

我做错了吗?或者在golang中有更好的搜索算法?

感谢您的帮助。

1 个答案:

答案 0 :(得分:0)

对于语言的特定部分,我真的不能帮您,但是如果有帮助,这里是一个伪代码,描述了Trie实现以及以相当有效的方式解决当前问题的功能。

struct TrieNode{
    map[char] children      // maps character to children
    set[int] contains       // set of all ids of documents that contain the word
}

// classic search function in trie, except it returns a set of document ids instead of a simple boolean
function get_doc_ids(TrieNode node, string w, int depth){
    if (depth == length(w)){
        return node.contains
    } else {
        if (node.hasChild(w[depth]) {
            return get_doc_ids(node.getChild(w[depth], w, depth+1)
        } else {
            return empty_set()
        }
    }
}

// the answering query function, as straight forward as it can be
function answer_query(TrieNode root, list_of_words L){
    n = length(L)
    result = get_docs_ids(root, L[0], 0)
    for i from 1 to n-1 do {
        result = intersection(result, get_docs_ids(root, L[i], 0))  // set intersection 
        if (result.is_empty()){
            break  // no documents contains them all, no need to check further
        }
    }
    return result
}