我们有一个问题 - 如下所示的答案语料库
Q: Why did Lincoln issue the Emancipation Proclamation?
A: The goal was to weaken the rebellion, which was led and controlled by slave owners.
Q: Who is most noted for his contributions to the theory of molarity and molecular weight?
A: Amedeo Avogadro
Q: When did he drop John from his name?
A: upon graduating from college
Q: What do beetles eat?
A: Some are generalists, eating both plants and animals. Other beetles are highly specialised in their diet.
将问题视为查询和答案作为文档。
我们必须构建一个系统,对于给定的查询(在语义上类似于问题语料库中的一个问题)能够获得正确的文档(答案语料库中的答案)
任何人都可以建议任何算法或好的方法来继续构建它。
答案 0 :(得分:3)
您的问题过于宽泛,您正在尝试完成的任务具有挑战性。不过,我建议你阅读IR-based Factoid Question Answering。本文档引用了许多最先进的技术。阅读本文档应该会引导您了解一些想法。
请注意,您需要针对基于IR的Factoid QA和基于知识的QA采用不同的方法。首先,确定您要构建的QA系统类型。
最后,我认为QA的简单文档匹配技术还不够。但你可以尝试使用Lucene
@Debasis建议的简单方法,看看它是否表现良好。
答案 1 :(得分:0)
在Lucene中考虑一个问题及其答案(假设只有一个)作为单个文档。 Lucene支持文档的视野;因此,在构建文档时,使问题成为可搜索字段。在给定查询问题的情况下检索排名靠前的问题后,请使用Document类的get方法返回答案。
代码框架(自己填写):
//Index
IndexWriterConfig iwcfg = new IndexWriterConfig(new StandardAnalyzer());
IndexWriter writer = new IndexWriter(...);
....
Document doc = new Document();
doc.add(new Field("FIELD_QUESTION", questionBody, Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("FIELD_ANSWER", answerBody, Field.Store.YES, Field.Index.ANALYZED));
...
...
// Search
IndexReader reader = new IndexReader(..);
IndexSearcher searcher = new IndexSearcher(reader);
...
...
QueryParser parser = new QueryParser("FIELD_QUESTION", new StandardAnalyzer());
Query q = parser.parse(queryQuestion);
...
...
TopDocs topDocs = searcher.search(q, 10); // top-10 retrieved
// Accumulate the answers from the retrieved questions which
// are similar to the query (new) question.
StringBuffer buff = new StringBuffer();
for (ScoreDoc sd : topDocs.scoreDocs) {
Document retrievedDoc = reader.document(sd.doc);
buff.append(retrievedDoc.get("FIELD_ANSWER")).append("\n");
}
System.out.println("Generated answer: " + buff.toString());