我正在尝试构建质量检查嵌入功能。如果该函数在问题嵌入方面效果很好,则该函数在句子嵌入方面存在问题。对于数据帧中的每个项目,train['sent_emb']
都会在找到答案时给出其得分dict_emb
。在句子嵌入中找不到任何项目:
train['sent_emb'] = train['sentences'].apply(
lambda x: [dict_emb[item][0] if item in dict_emb
else np.zeros(4096)
for item in x])
以下是dict_emb
的摘录:
{'What event was Frédéric a part of when he arrived in Paris during the later part of September in 1831?': array([[0.00812027, 0.0661487 , 0.05848939, ..., 0.02172186, 0.085614 ,
0.04505331]], dtype=float32), 'To whom did Beyonce credit as her major influence on her music?': array([[ 0.01196026, 0.07206462, 0.0604387 , ..., -0.00673536,
0.08809125, 0.04786895]], dtype=float32), 'Who was the first female to achieve the International Artist Award at the American Music Awards?': array([[0.00737114, 0.05858064, 0.04078764, ..., 0.02477051, 0.06046902,
0.06636532]], dtype=float32),...
您是否可以看到它似乎包含问题,但没有一个像测试的第一个item
一样:
Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress.
就disc_emb
似乎在嵌入问题上发挥作用而言,我觉得嵌入字典是有罪的。
所有这些尝试都是为了重现this girl's question answering tutorial。
我想知道create_emb.ipynb中这部分代码是否有罪:
import time
dict_embeddings = {}
t0 = time.time()
for i in range(len(questions)):
if i%1000 == 0:
t1 = time.time()
total = t1-t0
print("encoding number ",i," time since beginning:", total)
dict_embeddings[questions[i]] = model.encode([questions[i]], tokenize=True)
似乎我们只对问题进行编码。