Question

我正在尝试构建质量检查嵌入功能。如果该函数在问题嵌入方面效果很好，则该函数在句子嵌入方面存在问题。对于数据帧中的每个项目，train['sent_emb']都会在找到答案时给出其得分dict_emb。在句子嵌入中找不到任何项目：

train['sent_emb'] = train['sentences'].apply(
    lambda x: [dict_emb[item][0] if item in dict_emb 
               else np.zeros(4096) 
               for item in x])

以下是dict_emb的摘录：

{'What event was Frédéric a part of when he arrived in Paris during the later part of September in 1831?': array([[0.00812027, 0.0661487 , 0.05848939, ..., 0.02172186, 0.085614  ,
        0.04505331]], dtype=float32), 'To whom did Beyonce credit as her major influence on her music?': array([[ 0.01196026,  0.07206462,  0.0604387 , ..., -0.00673536,
         0.08809125,  0.04786895]], dtype=float32), 'Who was the first female to achieve the International Artist Award at the American Music Awards?': array([[0.00737114, 0.05858064, 0.04078764, ..., 0.02477051, 0.06046902,
        0.06636532]], dtype=float32),...

您是否可以看到它似乎包含问题，但没有一个像测试的第一个item一样：

Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress.

就disc_emb似乎在嵌入问题上发挥作用而言，我觉得嵌入字典是有罪的。

所有这些尝试都是为了重现this girl's question answering tutorial。

有罪的代码部分

我想知道create_emb.ipynb中这部分代码是否有罪：

import time
dict_embeddings = {}
t0 = time.time()
for i in range(len(questions)):
    if i%1000 == 0:
        t1 = time.time()
        total = t1-t0
        print("encoding number ",i," time since beginning:", total)
    dict_embeddings[questions[i]] = model.encode([questions[i]], tokenize=True)

似乎我们只对问题进行编码。

如何为质量检查系统构建句子嵌入词典？

有罪的代码部分

0 个答案: