索引solr中文件夹内的所有文件

时间:2016-07-05 13:09:22

标签: indexing solr lucene directory

我遇到麻烦索引文件夹

的问题

示例数据-config.xml中:

<dataConfig>  
<dataSource type="BinFileDataSource" />
    <document>
        <entity name="files" 
        dataSource="null" 
        rootEntity="false"
        processor="FileListEntityProcessor"
        baseDir="C:\Temp\" fileName=".*"
        recursive="true"
        onError="skip">
            <field column="fileAbsolutePath" name="id" />
            <field column="fileSize" name="size" />
            <field column="fileLastModified" name="lastModified" />

            <entity
                name="documentImport"
                processor="TikaEntityProcessor"
                url="${files.fileAbsolutePath}"
                format="text">
                <field column="file" name="fileName"/>
                <field column="Author" name="author" meta="true"/>
                <field column="text" name="text"/>

            </entity>
    </entity>
    </document> 

然后我创建了schema.xml:

    <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
    <field name="fileName" type="string" indexed="true" stored="true" />
    <field name="author" type="string" indexed="true" stored="true" />
    <field name="title" type="string" indexed="true" stored="true" />
    <field name="size" type="plong" indexed="true" stored="true" />
    <field name="lastModified" type="pdate" indexed="true" stored="true" />
    <field name="text" type="text_general" indexed="true" stored="true" multiValued="true"/>

最后我修改了solrConfig.xml文件,添加了requesthandler和dataImportHandler以及dataImportHandler-extra jars:

    <requestHandler name="/dataimport" class="solr.DataImportHandler">
    <lst name="defaults">
        <str name="config">example-data-config.xml</str>
    </lst>
</requestHandler>

我运行它,结果是:

result

在该文件夹中,有不同格式的20.000个文件(.py,.java,.wsdl等)

任何建议将不胜感激。谢谢:))

2 个答案:

答案 0 :(得分:0)

检查您的Solr日志。答案是什么DataImportHandler肯定会在那里。我也遇到过同样的情况,并通过solr日志发现我的encrypted documents因为文件夹中存在entity而抛出异常。您的原因可能有所不同,但首先要分析您的solr日志,再次在DataImport部分执行logging,然后通过管理页面上的Private Sub LoadActiveCB() Dim _Active As New List(Of ActiveCB) _Active.Add(New ActiveCB With {.Name = "Fixed", .ID = 1}) _Active.Add(New ActiveCB With {.Name = "Multiple", .ID = 2}) _Active.Add(New ActiveCB With {.Name = "Repeated", .ID = 3}) cbActive.DataSource = _Active cbActive.DisplayMember = "Name" cbActive.ValueMember = "ID" End Sub Class ActiveCB Property Name As String Property ID As Byte End Class 部分检查即时日志中的错误。如果你得到的不是我提到的错误,请在这里发布,这样就可以理解和破译它们。

答案 1 :(得分:0)

ERROR (Thread-17) [   x:example] o.a.s.h.d.DocBuilder Exception while processing: files document : null:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read content Processing Document # 157
at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69)
at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:165)
at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:244)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:475)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:514)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:414)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:329)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:461)
 Caused by: org.apache.tika.exception.TikaException: image/png parse error
at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:115)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:159)
... 9 more
 Caused by: javax.imageio.IIOException: I/O error reading PNG header!
at com.sun.imageio.plugins.png.PNGImageReader.readHeader(PNGImageReader.java:315)
at com.sun.imageio.plugins.png.PNGImageReader.getWidth(PNGImageReader.java:1361)
at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:92)
... 13 more
Caused by: javax.imageio.IIOException: Image width == 0!
at com.sun.imageio.plugins.png.PNGImageReader.readHeader(PNGImageReader.java:273)
... 15 more