Question

我有两个与我合作的数据文件。一个包含一个单词列表，其中包含有关这些单词的一些附加信息，另一个单词包含单词对（其中单词由第一个表中的单词ID列出）及其频率。

词典文件（示例输出）

('wID', 'w1', 'w1cs', 'L1', 'c1')
('-----', '-----', '-----', '-----', '-----')
(1, ',', ',', ',', 'y')
(2, '.', '.', '.', 'y')
(3, 'the', 'the', 'the', 'at')
(4, 'and', 'and', 'and', 'cc')
(5, 'of', 'of', 'of', 'io')

Bigram文件（示例输出）

('freq', 'w1', 'w2')
(4, 22097, 161)
(1, 98664, 1320)
(1, 426515, 1345)
(1, 483675, 747)
(19, 63, 15496)
(2, 3011, 7944)
(1, 27985, 27778)

我使用SQLite创建了两个表，并从上面的文件中上传了数据。

conn = sqlite3.connect('bigrams.db')
conn.text_factory = str
c = conn.cursor()
c.execute('pragma foreign_keys=ON')

词典表

c.execute('''CREATE TABLE lex
            (wID INT PRIMARY KEY, w1 TEXT, w1cs TEXT, L1 TEXT, c1 TEXT)''')

#I removed this index as per CL.'s suggestion
#c.execute('''DROP INDEX IF EXISTS lex_index''') 
#c.execute('''CREATE INDEX lex_index ON lex (wID, w1, c1)''')

#and added this one
c.execute('''CREATE INDEX lex_w1_index ON lex (w1)''')

将数据插入词典表

#I replaced this code
# with open('/Users/.../lexicon.txt', "rb") as lex_file:
#    for line in lex_file:
#        currentRow = line.split('\t')
#        try:
#            data = [currentRow[0], currentRow[1], currentRow[2], currentRow[3], str(currentRow[4].strip('\r\n'))]
#           c.executemany ('insert or replace into lex values (?, ?, ?, ?, ?)', (data,))
#        except IndexError:
#            pass   


#with the one that Julian wrote

blocksize = 100000

with open('/Users/.../lexicon.txt', "rb") as lex_file:
    data = []
    line_counter = 0
    for line in lex_file:
        data.append(line.strip().split('\t'))
        line_counter += 1
        if line_counter % blocksize == 0:
            try:
                c.executemany ('insert or replace into lex values (?, ?, ?, ?, ?)', data)
                conn.commit()
            except IndexError:
                block_start = line_counter - blocksize + 1
                print 'Lex error lines {}-{}'.format(block_start, line_counter)
            finally:
                data = []

Bigram表

#I replaced this code to create table x2 
#c.execute('''CREATE TABLE x2
#             (freq INT, w1 INT, w2 INT, FOREIGN KEY(w1) REFERENCES lex(wID), FOREIGN KEY(w2) REFERENCES lex(wID))''')

#with the code that Julian suggested
c.execute('''CREATE TABLE x2
             (freq INT, w1 INT, w2 INT,
              FOREIGN KEY(w1) REFERENCES lex(wID),
              FOREIGN KEY(w2) REFERENCES lex(wID),
              PRIMARY KEY(w1, w2) )''')

将数据插入bigram表

#Replaced this code
#with open('/Users/.../x2.txt', "rb") as x2_file:
#    for line in x2_file:
#        currentRow = line.split('\t')
#        try:
#            data = [str(currentRow[0].replace('\x00','').replace('\xff\xfe','')), str(currentRow[1].replace('\x00','')), str(currentRow[2].replace('\x00','').strip('\r\n'))]
#           c.executemany('insert or replace into x2 values (?, ?, ?)', (data,))
#        except IndexError:
#            pass

#with this one suggested by Julian 
with open('/Users/.../x2.txt', "rb") as x2_file:
    data = []
    line_counter = 0
    for line in x2_file:
        data.append(line.strip().replace('\x00','').replace('\xff\xfe','').split('\t'))
        line_counter += 1
        if line_counter % blocksize == 0:
            try:
                c.executemany('insert or replace into x2 values (?, ?, ?)', data)
                conn.commit()
            except IndexError:
                block_start = line_counter - blocksize + 1
                print 'x2 error lines {}-{}'.format(block_start, line_counter)
            finally:
                data = []

conn.close()

我希望能够检查数据中是否存在给定的单词对 - 例如＆＃34;像new＆＃34;

当我仅指定第一个单词时，程序运行正常。

cur.execute('''SELECT lex1.w1, lex2.w1 from x2 
                INNER JOIN lex as lex1 ON lex1.wID=x2.w1
                INNER JOIN lex as lex2 ON lex2.wID=x2.w2
                WHERE lex1.w1= “like” ’’’)

但是当我想搜索一对单词时，代码非常缓慢。

cur.execute('''SELECT lex1.w1, lex2.w1 from x2 
                    INNER JOIN lex as lex1 ON lex1.wID=x2.w1
                    INNER JOIN lex as lex2 ON lex2.wID=x2.w2
                    WHERE lex1.w1=“like” AND lex2.w1= “new” ''')

我无法弄清楚我做错了什么。任何帮助将非常感激。

Answer 1

EXPLAIN QUERY PLAN显示数据库首先扫描x2表，然后查找每个lex行的相应x2行，检查单词是否匹配。 lex查找是使用临时索引完成的，但对x2中的每一行执行此查找两次仍会使整个查询变慢。

如果数据库可以首先查找两个单词的ID，并且在x2中搜索具有这两个ID的行，则查询会很快。这需要一些新的索引。（lex_index索引仅对从wID列开始的查找有用（并且此类查找可能已使用主键索引）。）

您需要创建一个允许搜索w1的索引：

CREATE INDEX lex_w1_index ON lex(w1);

要查找包含两个单词ID的任何x2行，您需要在最左侧位置使用这两列的索引：

CREATE INDEX x2_w1_w2_index ON x2(w1, w2);

或者，将这两列作为主要索引（请参阅Julian的答案）。

要强制数据库首先执行单词ID查找，您可以将它们移动到子查询中：

SELECT freq
FROM x2
WHERE w1 = (SELECT wID FROM lex WHERE w1 = 'like')
  AND w2 = (SELECT wID FROM lex WHERE w1 = 'new')

然而，这不应该是必要的;使用新索引，优化器应该能够自动找到优化查询计划。（但如果您认为它更具可读性，您仍然可以使用此查询。）

Answer 2

像这样定义你的x2表。

c.execute('''CREATE TABLE x2
             (freq INT, w1 INT, w2 INT,
              FOREIGN KEY(w1) REFERENCES lex(wID),
              FOREIGN KEY(w2) REFERENCES lex(wID),
              PRIMARY KEY(w1, w2) )''')

除了在语义上正确之外，这还会创建一个永久性索引，从而大大加快查询速度。如果不指定（w1，w2）对是表的主键，则每次运行该查询时都必须临时重新创建，这是一项代价高昂的操作。

以下代码可用于重新定义表而无需重新导入所有内容。

c.execute(''' create table x2_new ( freq INT, w1 INT, w2 INT, FOREIGN KEY(w1) REFERENCES lex(wID), FOREIGN KEY(w2) REFERENCES lex(wID), PRIMARY KEY(w1, w2) ) ''') c.execute('insert into x2_new select * from x2') c.execute('drop table x2') c.execute('alter table x2_new rename to x2') conn.commit()

以下代码应加快插入速度。

blocksize = 100000 with open('/Users/.../lexicon.txt', "rb") as lex_file: data = [] line_counter = 0 for line in lex_file: data.append(line.strip().split('\t')) line_counter += 1 if line_counter % blocksize == 0: try: c.executemany ('insert or replace into lex values (?, ?, ?, ?, ?)', data) conn.commit() except IndexError: block_start = line_counter - blocksize + 1 print 'Lex error lines {}-{}'.format(block_start, line_counter) conn.rollback() finally: data = [] with open('/Users/.../x2.txt', "rb") as x2_file: data = [] line_counter = 0 for line in x2_file: data.append(line.strip().replace('\x00','').replace('\xff\xfe','').split('\t')) line_counter += 1 if line_counter % blocksize == 0: try: c.executemany('insert or replace into x2 values (?, ?, ?)', data) conn.commit() except IndexError: block_start = line_counter - blocksize + 1 print 'x2 error lines {}-{}'.format(block_start, line_counter) conn.rollback() finally: data = []

Answer 3

如果找到其中一个单词的行确实很快，您可以根据结果创建一个临时表，然后再在该表中搜索。例如：

DROP TABLE IF EXISTS x2_temp;
CREATE TABLE x2_temp AS
    SELECT lex.*, x2.w2 from x2 
        INNER JOIN lex ON lex.wID=x2.w1
        WHERE lex.w1 = 'like';

SELECT x2_temp.*, lex.* from x2_temp
    INNER JOIN lex ON lex.wID=x2_temp.w2
    WHERE lex.w1 = 'new';

您也可以将两者结合使用，而不使用临时表（不确定是否有帮助）：

SELECT x.*, lex.* FROM 
    (SELECT lex.*, x2.w2 FROM x2 
        INNER JOIN lex ON lex.wID=x2.w1
        WHERE lex.w1 = 'like') AS x
    INNER JOIN lex ON lex.wID=x.w2
    WHERE lex.w1 = 'new';

（这些在sqlite3中执行，但我没有数据，我没有花时间创建测试数据;但它应该是正确的。）

麻烦在SQLite中的连接语句

3 个答案: