Lucene使用标识符索引HTML文件

时间:2018-05-16 20:46:59

标签: java lucene

首先,我想为我可怜的英语道歉。 我有一个名为cacm.html的html文件,其中包含大量文档,每个文档都是这样构建的:

.I indicates article identifier
.T idicates article title
.A indicates article authors
.W indicates article resume
.X indicates article references

这是一篇文章的例子:

.I 20
.T
Accelerating Convergence of Iterative Processes
.W
A technique is discussed which, when applied
to an iterative procedure for the solution of
an equation, accelerates the rate of convergence if
the iteration converges and induces convergence if
the iteration diverges.  An illustrative example is given.
.B
CACM June, 1958
.A
Wegstein, J. H.
.N
CA580602 JB March 22, 1978  9:09 PM
.X
20  5   20
20  5   20
20  5   20
我写了这段代码:

    //IMPORTS

Public class in{
    public static void main(String[] args) throws IOException{  
    Path p = Paths.get("C:\\Users\\pc\\Desktop\\indexationeclipc", args);       
    StandardAnalyzer analyzer = new StandardAnalyzer();
    Directory directory = FSDirectory.open(p);
    IndexWriterConfig config = new IndexWriterConfig(analyzer);
    IndexWriter iwriter = new IndexWriter(directory, config);

    BufferedReader br = new BufferedReader(new FileReader("C:\\Users\\pc\\Desktop\\index\\cacm.htm"));

    boolean lire=false;

    String[] fields = new String[5];
    for (int i = 0; i < fields.length; i++) {
        fields[i] = "";
    }
    int fieldno = 0;

    String line=br.readLine();
    while(line!=null) {
        if(line.startsWith(".I")) {


            String[] parts = line.split(" ");
            fields[0] = parts[1];
            fieldno = 0;



            if (!fields[0].equals("")) {
               Document doc=new Document();
               Field I= new StringField("I",fields[0],Field.Store.YES);
               doc.add(I);

               Field T= new StringField("T",fields[1],Field.Store.YES);
               doc.add(T);

               Field A= new StringField("A",fields[2],Field.Store.YES);
               doc.add(A);

               Field W= new TextField("W",fields[3],Field.Store.YES);
               doc.add(W);

               Field X= new TextField("X",fields[4],Field.Store.YES);
               doc.add(X);

               iwriter.addDocument(doc);

            }

             for (int i = 0; i < fields.length; i++) {
                 fields[i] = "";
             }


        }


        else if(line.startsWith(".T")) {
            lire=true;
            fieldno = 1;

        }

        else if(line.startsWith(".A")) {
            lire=true;
            fieldno = 2;
        }

        else if(line.startsWith(".W")) {
            lire=true;
            fieldno = 3;
        }

        else if(line.startsWith(".X")) {
            lire=true;
            fieldno = 4;
        }

        else if(line.startsWith(".")) {
            lire=false;
        }

        if((fieldno > 0) && (fieldno < 5)) {
            if(lire==true) {
            if (line.length() > 2) {
                fields[fieldno] += " " + line;
            }}
        }

        line = br.readLine();       

    }


br.close();
iwriter.close();


    }       
}

但指数没有完成,指数停止并且它不是所有被索引的条款,并且他的索引像是一千次相同的单词,有时他指数全部短语而不仅仅是条款:(x))

1 个答案:

答案 0 :(得分:0)

这不会成为索引或lucene问题。我认为你如何读取这些数据并将其分开是个问题。

我建议使用apache tika进行html数据提取。这非常有用。见https://tika.apache.org/