我正在尝试创建一个句子自动完成模型,它会建议类似的句子。
问题:我有一个超过20000个句子的句子语料库。我想创建一个程序,当用户键入他/她的键盘时,会向用户建议类似的句子。
例如 -
user: wh
suggestions: [{'what is your name?'},{'what is your profession?'},{'what do you want?'}, {'where are you?'}]
user: what is your
suggestions: [{'what is your name?'},{'what is your profession?'}]
注意:
我的方法: - 到目前为止,我只想出一个使用trie数据结构来存储文本语料库中每个句子的解决方案。
我想知道是否有任何机器学习技术可以用于句子建议,同时也考虑句子前缀。 我真的很感激任何能指出我正确方向的人。
答案 0 :(得分:0)
文本生成是RNN的常见应用。给定句子前缀,可以训练神经网络以预测最可能的下一个单词。 可以找到Andrej Karpathy写的一篇非常有趣的文章here以及相应的github repo。
另一种流行的方法是使用马尔可夫链来生成文本(例如,参见here)
答案 1 :(得分:0)
public static void main(String[] args) throws IOException {
Main m = new Main();
m.init();
m.writerEntries();
m.findSilimar("doduck prototype");
}
private Directory indexDir;
private StandardAnalyzer analyzer;
private IndexWriterConfig config;
public void init() throws IOException{
analyzer = new StandardAnalyzer(Version.LUCENE_42);
config = new IndexWriterConfig(Version.LUCENE_42, analyzer);
config.setOpenMode(OpenMode.CREATE_OR_APPEND);
indexDir = new RAMDirectory(); //do not write on disk
}
public void writerEntries() throws IOException{
IndexWriter indexWriter = new IndexWriter(indexDir, config);
indexWriter.commit();
Document doc1 = createDocument("1","doduck","prototype your idea");
Document doc2 = createDocument("2","doduck","love programming");
Document doc3 = createDocument("3","We do", "prototype");
Document doc4 = createDocument("4","We love", "challange");
indexWriter.addDocument(doc1);
indexWriter.addDocument(doc2);
indexWriter.addDocument(doc3);
indexWriter.addDocument(doc4);
indexWriter.commit();
indexWriter.forceMerge(100, true);
indexWriter.close();
}
private Document createDocument(String id, String title, String content) {
FieldType type = new FieldType();
type.setIndexed(true);
type.setStored(true);
type.setStoreTermVectors(true); //TermVectors are needed for MoreLikeThis
Document doc = new Document();
doc.add(new StringField("id", id, Store.YES));
doc.add(new Field("title", title, type));
doc.add(new Field("content", content, type));
return doc;
}
private void findSilimar(String searchForSimilar) throws IOException {
IndexReader reader = DirectoryReader.open(indexDir);
IndexSearcher indexSearcher = new IndexSearcher(reader);
MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setMinTermFreq(0);
mlt.setMinDocFreq(0);
mlt.setFieldNames(new String[]{"title", "content"});
mlt.setAnalyzer(analyzer);
Reader sReader = new StringReader(searchForSimilar);
Query query = mlt.like(sReader, null);
TopDocs topDocs = indexSearcher.search(query,10);
for ( ScoreDoc scoreDoc : topDocs.scoreDocs ) {
Document aSimilar = indexSearcher.doc( scoreDoc.doc );
String similarTitle = aSimilar.get("title");
String similarContent = aSimilar.get("content");
System.out.println("====similar finded====");
System.out.println("title: "+ similarTitle);
System.out.println("content: "+ similarContent);
}
}