Question

我正在尝试创建一个句子自动完成模型，它会建议类似的句子。

问题：我有一个超过20000个句子的句子语料库。我想创建一个程序，当用户键入他/她的键盘时，会向用户建议类似的句子。

例如 -

user: wh
suggestions: [{'what is your name?'},{'what is your profession?'},{'what do you want?'}, {'where are you?'}]

user: what is your
suggestions: [{'what is your name?'},{'what is your profession?'}]

注意：

单词的排序很重要，即句子前缀和用户输入应该相同。
句子建议来自可用的文本语料库。

我的方法： - 到目前为止，我只想出一个使用trie数据结构来存储文本语料库中每个句子的解决方案。

我想知道是否有任何机器学习技术可以用于句子建议，同时也考虑句子前缀。我真的很感激任何能指出我正确方向的人。

Answer 1

文本生成是RNN的常见应用。给定句子前缀，可以训练神经网络以预测最可能的下一个单词。可以找到Andrej Karpathy写的一篇非常有趣的文章here以及相应的github repo。

另一种流行的方法是使用马尔可夫链来生成文本（例如，参见here）

Answer 2

如果你想使用Lucene的续付，MoreLike这个类似的句子。或者你可以应用余弦相似性。希望这会有所帮助。

public static void main(String[] args) throws IOException {
    Main m = new Main();
    m.init();
    m.writerEntries();
    m.findSilimar("doduck prototype");
}

private Directory indexDir;
private StandardAnalyzer analyzer;
private IndexWriterConfig config;

public void init() throws IOException{
    analyzer = new StandardAnalyzer(Version.LUCENE_42);
    config = new IndexWriterConfig(Version.LUCENE_42, analyzer);
    config.setOpenMode(OpenMode.CREATE_OR_APPEND);

    indexDir = new RAMDirectory(); //do not write on disk

}

public void writerEntries() throws IOException{
    IndexWriter indexWriter = new IndexWriter(indexDir, config);
    indexWriter.commit();

    Document doc1 = createDocument("1","doduck","prototype your idea");
    Document doc2 = createDocument("2","doduck","love programming");
    Document doc3 = createDocument("3","We do", "prototype");
    Document doc4 = createDocument("4","We love", "challange");
    indexWriter.addDocument(doc1);
    indexWriter.addDocument(doc2);
    indexWriter.addDocument(doc3);
    indexWriter.addDocument(doc4);

    indexWriter.commit();
    indexWriter.forceMerge(100, true);
    indexWriter.close();
}

private Document createDocument(String id, String title, String content) {
    FieldType type = new FieldType();
    type.setIndexed(true);
    type.setStored(true);
    type.setStoreTermVectors(true); //TermVectors are needed for MoreLikeThis

    Document doc = new Document();
    doc.add(new StringField("id", id, Store.YES));
    doc.add(new Field("title", title, type));
    doc.add(new Field("content", content, type));
    return doc;
}


private void findSilimar(String searchForSimilar) throws IOException {
    IndexReader reader = DirectoryReader.open(indexDir);
    IndexSearcher indexSearcher = new IndexSearcher(reader);

    MoreLikeThis mlt = new MoreLikeThis(reader);
    mlt.setMinTermFreq(0);
    mlt.setMinDocFreq(0);
    mlt.setFieldNames(new String[]{"title", "content"});
    mlt.setAnalyzer(analyzer);


    Reader sReader = new StringReader(searchForSimilar);
    Query query = mlt.like(sReader, null);

    TopDocs topDocs = indexSearcher.search(query,10);

    for ( ScoreDoc scoreDoc : topDocs.scoreDocs ) {
        Document aSimilar = indexSearcher.doc( scoreDoc.doc );
        String similarTitle = aSimilar.get("title");
        String similarContent = aSimilar.get("content");

        System.out.println("====similar finded====");
        System.out.println("title: "+ similarTitle);
        System.out.println("content: "+ similarContent);
    }

}

提出类似的句子

2 个答案: