基于知识的Q-A系统没有给出最合适的答案

时间:2012-03-23 13:35:24

标签: nlp question-answering

我正在开展一个基本上是基于知识的问答系统的项目。我的系统从用户处获取查询,从维基百科下载相关文档,剥离所有html标签并提取纯文本。在此之后,它将文档标记为句子,然后形成术语 - 文档(TD)矩阵(查询也作为句子传递)。然后将该TD矩阵转发到pLSA(概率潜在合法分析)算法。然后,最后用查询向量计算文档(句子)向量之间的余弦相似度。基于与查询向量的相似性,最相关的句子显示为答案。 (在TD矩阵的形成过程中也进行了干扰)。 问题是确实显示结果,但不是最相关的。我哪里错了?我遵循的策略是正确的,还是存在可能有帮助的任何其他算法? 下面我展示了我的系统返回的一些问题及其答案:

What is photosynthesis?
ANSWER  1 :   The stroma contains stacks (grana) of thylakoids, which are the site of photosynthesis 

ANSWER  2 :   Factors leaf is the primary site of photosynthesis in plants 

ANSWER  3 :   Samuel Ruben and Martin Kamen used radioactive isotopes to determine that the oxygen liberated in photosynthesis came from the water 

ANSWER  4 :   In plants, algae and cyanobacteria, photosynthesis releases oxygen 

另一个问题

What is Artificial Intelligence?
ANSWER  1 :   the problem of creating 'artificial intelligence' will substantially be solved" 

ANSWER  2 :   37 The leading-edge definition of artificial intelligence research is changing over time 

ANSWER  3 :   Stories of these creatures and their fates discuss many of the same hopes, fears and ethical concerns that are presented by artificial intelligence 

ANSWER  4 :   History of artificial intelligence and Timeline of artificial intelligence Thinking machines and artificial beings appear in Greek myths , such as Talos of Crete , the bronze robot of Hephaestus , and Pygmalion's Galatea 13 Human likenesses believed to have intelligence were built in every major civilization 

另一个问题

Who is a hacker?

ANSWER  1 :   19 Hackers (short stories) Helba from the  

ANSWER  2 :   16 Rafael Núñez aka RaFa was a notorious most wanted hacker by the FBI since 2001 

ANSWER  3 :   Often, this type of 'white hat' hacker is called an ethical hacker 
ANSWER  4 :   Hackers also commonly use port scanners  

又一次运行

What is biology?
ANSWER  1 :   Molecular biology is the study of biology at a molecular level 

ANSWER  2 :   molecular biology studies the complex interactions of systems of biological molecules 

ANSWER  3 :   The similarities and differences between cell types are particularly relevant to molecular biology 

ANSWER  4 :   Contents History Foundations of modern biology 2 

2 个答案:

答案 0 :(得分:2)

这是一个经过充分研究的问题,称为问答(QA)。我在another answer中提供了有关质量保证的摘要。特别是,根据TREC,您的所有示例都属于“定义问题”类别。我建议仔细阅读GoogleGoogle Scholar对“TREC定义问题”的查询所产生的一些论文。

答案 1 :(得分:1)

我认为如果你保持完整的统计方法,很难改进你的系统。从统计NLP的角度来看,你真的做了正确的事情。现在,您可以微调一些参数。要做到这一点,你必须通过告诉系统哪个答案是正确答案来构建训练语料库......然后看看参数必须采用哪个值来给你这个答案。

话虽如此,我认为微调参数不会使你的准确度提高20%~30%。

如果你想更进一步,你需要更多的语义方法,并象征性地代表知识。检查实例http://www.jfsowa.com/