Question

假设我们有100万行，例如：

((char *)buffer) [i] = buffer_rep[i];

背景：

我知道如何使用Sqlite：

使用单字查询查找一行，使用spellfix模块和Levenshtein距离查找最多几个拼写错误（我有发布了detailed answer here关于如何编译，如何使用它，...）：
```
import sqlite3
db = sqlite3.connect(':memory:')
c = db.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "Riemann")')
c.execute('INSERT INTO mytable VALUES (2, "All the Carmichael numbers")')
```
具有1M行，这将非常慢！与detailed here一样，db.enable_load_extension(True) db.load_extension('./spellfix') c.execute('SELECT * FROM mytable WHERE editdist3(description, "Riehmand") < 300'); print c.fetchall() #Query: 'Riehmand' #Answer: [(1, u'Riemann')]可能使用postgresql进行了优化。 Sqlite提供的一种快速解决方案是使用trigrams：
```
VIRTUAL TABLE USING spellfix
```
使用FTS（“全文搜索”）查找与一个或多个单词 匹配的查询的表达式：
```
c.execute('CREATE VIRTUAL TABLE mytable3 USING spellfix1')
c.execute('INSERT INTO mytable3(word) VALUES ("Riemann")')
c.execute('SELECT * FROM mytable3 WHERE word MATCH "Riehmand"'); print c.fetchall()

#Query: 'Riehmand'
#Answer: [(u'Riemann', 1, 76, 0, 107, 7)], working!
```
它不区分大小写，您甚至可以使用带有错误顺序的两个单词的查询，等等：FTS确实非常强大。但是缺点是每个查询关键字必须正确拼写，即仅FTS不允许出现拼写错误。

问题：

如何使用Sqlite 进行全文搜索（FTS）并允许拼写错误？？即同时使用“ FTS + spellfix”

示例：

进入数据库行：c.execute('CREATE VIRTUAL TABLE mytable2 USING fts4(id integer, description text)') c.execute('INSERT INTO mytable2 VALUES (2, "All the Carmichael numbers")') c.execute('SELECT * FROM mytable2 WHERE description MATCH "NUMBERS carmichael"'); print c.fetchall() #Query: 'NUMBERS carmichael' #Answer: [(2, u'All the Carmichael numbers')]
查询："All the Carmichael numbers"应该匹配！

如何使用Sqlite做到这一点？

由于this page指出，Sqlite可能是可行的：

或者，[spellfix]可以与FTS4一起使用，以使用可能拼错的单词进行全文搜索。

链接的问题：String similarity with Python + Sqlite (Levenshtein distance / edit distance)

Answer 1

spellfix1文档实际上告诉您如何执行此操作。来自Overview section：

如果您打算将此虚拟表与FTS4表配合使用（用于对搜索词进行拼写更正），则可以使用fts4aux表提取词汇表：
INSERT INTO demo(word) SELECT term FROM search_aux WHERE col='*';

SELECT term from search_aux WHERE col='*'语句extracts all the indexed tokens。

将其与您的示例（其中mytable2是fts4虚拟表）相连，您可以创建一个fts4aux表并将这些标记插入到mytable3 spellfix1表中，

CREATE VIRTUAL TABLE mytable2_terms USING fts4aux(mytable2);
INSERT INTO mytable3(word) SELECT term FROM mytable2_terms WHERE col='*';

您可能希望进一步限定该查询，以跳过已经在spellfix1中插入的所有字词，否则最终将出现两次输入：

INSERT INTO mytable3(word)
    SELECT term FROM mytable2_terms
    WHERE col='*' AND 
        term not in (SELECT word from mytable3_vocab);

现在，您可以使用mytable3将拼写错误的单词映射到更正的标记，然后在针对MATCH的{{1}}查询中使用这些更正的标记。

根据您的需要，这可能意味着您需要自己进行令牌处理和查询构建；没有公开的fts4查询语法解析器。因此，您需要拆分两个令牌的搜索字符串，每个令牌都通过mytable2表运行以映射到现有令牌，然后将这些令牌提供给fts4查询。

忽略SQL语法来处理此问题，使用Python进行拆分很容易：

spellfix1

然后返回def spellcheck_terms(conn, terms): cursor = conn.cursor() base_spellfix = """ SELECT :term{0} as term, word FROM spellfix1data WHERE word MATCH :term{0} and top=1 """ terms = terms.split() params = {"term{}".format(i): t for i, t in enumerate(terms, 1)} query = " UNION ".join([ base_spellfix.format(i + 1) for i in range(len(params))]) cursor.execute(query, params) correction_map = dict(cursor) return " ".join([correction_map.get(t, t) for t in terms]) def spellchecked_search(conn, terms): corrected_terms = spellcheck_terms(conn, terms) cursor = conn.cursor() fts_query = 'SELECT * FROM mytable2 WHERE mytable2 MATCH ?' cursor.execute(fts_query, (corrected_terms,)) return cursor.fetchall()的{{1}}。

使用Python进行拼写检查，然后可以根据需要支持更复杂的FTS查询；您可能必须reimplement the expression parser才能这样做，但是至少Python提供了执行此操作的工具。

一个完整的示例，将上述方法包装在一个类中，该类仅将术语提取为字母数字字符序列（根据我对表达式语法规范的理解，就足够了）：

[('All the Carmichael numbers',)]

和使用该类的交互式演示：

spellchecked_search(db, "NUMMBER carmickaeel")

Answer 2

公认的答案是好的（对他完全信任），这是一个很小的变化，尽管对于复杂的案例，它不如公认的那样完整，但有助于理解这个想法：

import sqlite3
db = sqlite3.connect(':memory:')
db.enable_load_extension(True)
db.load_extension('./spellfix')
c = db.cursor()
c.execute("CREATE VIRTUAL TABLE mytable2 USING fts4(description text)")
c.execute("CREATE VIRTUAL TABLE mytable2_terms USING fts4aux(mytable2)")
c.execute("CREATE VIRTUAL TABLE mytable3 USING spellfix1")
c.execute("INSERT INTO mytable2 VALUES ('All the Carmichael numbers')")   # populate the table
c.execute("INSERT INTO mytable2 VALUES ('They are great')")
c.execute("INSERT INTO mytable2 VALUES ('Here some other numbers')")
c.execute("INSERT INTO mytable3(word) SELECT term FROM mytable2_terms WHERE col='*'")

def search(query):
    # Correcting each query term with spellfix table
    correctedquery = []
    for t in query.split():
        spellfix_query = "SELECT word FROM mytable3 WHERE word MATCH ? and top=1"
        c.execute(spellfix_query, (t,))
        r = c.fetchone()
        correctedquery.append(r[0] if r is not None else t)  # correct the word if any match in the spellfix table; if no match, keep the word spelled as it is (then the search will give no result!)

    correctedquery = ' '.join(correctedquery)

    # Now do the FTS
    fts_query = 'SELECT * FROM mytable2 WHERE description MATCH ?'
    c.execute(fts_query, (correctedquery,))
    return {'result': c.fetchall(), 'correctedquery': correctedquery, 'query': query}

print(search('NUMBBERS carmickaeel'))
print(search('some HERE'))
print(search('some qsdhiuhsd'))

这是结果：

{'query'：'NUMBBERS carmickaeel'，'correctedquery'：u'numbers carmichael'，'result'：[（u'所有Carmichael数字'，）]}}
  {'query'：'some HERE'，'correctedquery'：u'some here'，'result'：[（u'Here some some number'，）]}
  {'query'：'一些qsdhiuhsd'，'correctedquery'：u'some qsdhiuhsd'，'result'：[]}

备注：可以注意到，“使用拼写修正表更正每个查询词” 部分每个词只执行一次SQL查询。对here的性能进行了研究，该结果与单个UNION SQL查询相比。

带有真正“全文搜索”和拼写错误的Sqlite（FTS + spellfix一起使用）

背景：

问题：

2 个答案: