如何使用SolrJ Java应用程序索引不同类型的文件(pdf,word,html..etc)

时间:2019-01-28 09:56:15

标签: java solrj

我是SolrJ的新手。我需要使用SolrJ Java API为zip,pdf和html文档编制索引。谁能给我一些例子,在Java应用程序中使用SolrJ为不同类型的文档建立索引?

我可以通过任何良好的链接来找到Java中的好示例来索引文件夹中可用的不同类型的文档...

谢谢您的帮助。

根据输出,很明显solrj没有索引我正在尝试的.xml文件,任何人都可以评论我在做什么错了...

代码:

 String urlString = "http://localhost:8983/solr/tests";
    HttpSolrClient solr = new HttpSolrClient.Builder(urlString).build();

    solr.setParser(new XMLResponseParser());

    File file = new File("D:/work/devtools/Solr/solr-7.6.0/example/exampledocs/hd.xml");
    InputStream fis = new FileInputStream(file);
    /* Tika specific */
    ContentHandler contenthandler = new BodyContentHandler(10 * 1024 * 1024);
    Metadata metadata = new Metadata();
    metadata.set(Metadata.RESOURCE_NAME_KEY, "hd.xml");
    ParseContext parseContext = new ParseContext();
    // Automatically detect best parser base on detected document type
    AutoDetectParser autodetectParser = new AutoDetectParser();
    // OOXMLParser parser = new OOXMLParser();
    autodetectParser.parse(fis, contenthandler, metadata, parseContext);
    SolrInputDocument doc = new SolrInputDocument();
    doc.addField("id", file.getCanonicalPath());
    SolrQuery query = new SolrQuery("*.*");
    // query.set("q", "price:599.99");
    QueryResponse response = solr.query(query);

输出:

solr query{responseHeader={status=0,QTime=0,params={q=*.*,wt=xml,version=2.2}},response={numFound=0,start=0,docs=[]}}

2 个答案:

答案 0 :(得分:0)

基本信息链接:https://www.youtube.com/watch?v=rxoS1p1TaFY&t=198s 2)https://lucene.apache.org/solr/链接下载最新版本        如何在Java应用程序中使用solrj:          java版本应为1.8         @)下载solr最新版本解压缩         1)在您的pom.xml文件中添加依赖项                      org.apache.solr             solr-solrj             7.6.0         

**从solr / bin文件夹启动solr并通过单击此http://localhost:8983/solr/#检查solr管理控制台         2)         基本示例代码:(此代码足以理解solrj)

    create the indexfiles core in solr and use the following code 

        String urlString = "http://localhost:8983/solr/indexfiles";
            HttpSolrClient solr = new HttpSolrClient.Builder(urlString).build();

            solr.setParser(new XMLResponseParser());
            File file = new File("D:/work/devtools/Solr/solr-7.6.0/example/exampledocs/176444.zip");

            ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract");

    //        req.addFile(file, "application/pdf");//change the content type for different input files
            req.addFile(file, "text/plain");
            String fileName = file.getName();
            req.setParam("literal.id", fileName);
            req.setAction(req.getAction().COMMIT, true, true);
            NamedList<Object> result = solr.request(req);
            int status = (Integer) ((org.apache.solr.common.util.SimpleOrderedMap) (result.get("responseHeader"))).get("status");

            System.out.println("Result: " +result);
            System.out.println("solr query"+ solr.query(new SolrQuery("*.*")));



    3)query from the solr admin console using this http://localhost:8983/solr/indexfiles/select?q=SOLR1000

    just change the text(q="<text to search>") that u want to search that available in the files that u indexed

    u can find query parameter q in the solr admin console where we can give the required text to search if u are not comfortable with solr querys by default it is  *:*


NOTE:dont need to think about Apache Tika and all to integrate it with Apache solr to index zip files and all because its by default available in solr new version

****Note: dont confuse by looking into the outputs from standalone admin(which gives complete data in the output ex: hd.xml is indexed which is available in the /exampledocs folder in solr) and the output u get by indexing the same files using solrj through java application

ex:solrj it will just index the file which means from the solr admin console u can see the following as out put when u fire query
(http://localhost:8983/solr/indexfiles/select?q=*:*)
output:

{
        "id":"hd.xml",
        "stream_size":["null"],
        "x_parsed_by":["org.apache.tika.parser.DefaultParser",
          "org.apache.tika.parser.xml.DcXMLParser"],
        "stream_content_type":["text/xml"],
        "content_type":["application/xml"],
        "_version_":1624155471570010112},


But if we index throw command prompt using ---> java -Dc=name -jar post.jar *.xml the output contains the data available inside the xml file (http://localhost:8983/solr/indexfiles/select?q=*:*)

答案 1 :(得分:0)

Xml用于将xml文件索引到Solr中的代码的特定版本。但是Xml应该采用以下格式。

<add>
<doc>
 <field name="id">PMID</field>
 <field name="year_i">Year</field>
 <field name="name">ArticleTitle</field>
 <field name="abstract_s">AbstractText</field>
 <field name="cat">MeshHeading1</field>
 <field name="cat">MeshHeading2</field>
</doc>
</add>

下面是将xml数据索引到Solr的代码。

    File xmlFile = new File("example.xml");
    Reader fileReader = new FileReader(xmlFile);
    BufferedReader bufReader = new BufferedReader(fileReader);

    StringBuilder sb = new StringBuilder();
    String line = bufReader.readLine();
    while( line != null){
        sb.append(line).append("\n");
        line = bufReader.readLine();
    }
    String xml2String = sb.toString();
    String urlString = String.format("http://localhost:8983/solr/%s", "pubmed1");
    HttpSolrClient server = new HttpSolrClient.Builder(urlString).build();
    server.setParser(new XMLResponseParser());
    DirectXmlRequest xmlreq = new DirectXmlRequest( "/update", xml2String );
    server.request( xmlreq );
    server.commit();

谈论Apache Tika,它将帮助您提取文件内容。该文件可以是xlsx,pdf,html,xml。如果是xml文件格式,则需要编写解析器以将solr xml格式的xml格式转换。如果是xml,则可以使用XSLT。 如果是Apache Tika,请参考: enter link description here