我有一个项目在使用Lucene搜索功能。在项目中,我有一个源文件夹,并给了一个将在其中创建索引的目标文件夹。在源文件夹中,我有多个文件夹,在文件夹中,我有多个html文件。在这里,我正在html页面(html内容)中进行通配符搜索。最初,搜索是根据命中找到文件路径,然后从该页面中找到合适的搜索结果。
现在我的问题是,搜索是在找到匹配结果的位置正确找到文件路径,但是当获取结果内容时,其返回空白值。
请在下面找到用于创建索引和搜索功能的代码段。
public class IndexCode
{
public static void main(String[] args)
{
String docsPath = "Souce";
String indexPath = "target";
final Path docDir = Paths.get(docsPath);
try
{
Directory dir = FSDirectory.open( Paths.get(indexPath) );
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
IndexWriter writer = new IndexWriter(dir, iwc);
indexDocs(writer, docDir);
writer.close();
}
catch (IOException e)
{
e.printStackTrace();
}
}
static void indexDocs(final IndexWriter writer, Path path) throws IOException
{
if (Files.isDirectory(path))
{
Files.walkFileTree(path, new SimpleFileVisitor<Path>()
{
public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException
{
try
{
indexDoc(writer, file, attrs.lastModifiedTime().toMillis());
}
catch (IOException ioe)
{
ioe.printStackTrace();
}
return FileVisitResult.CONTINUE;
}
});
}
else
{
indexDoc(writer, path, Files.getLastModifiedTime(path).toMillis());
}
}
static void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOException
{
try (InputStream stream = Files.newInputStream(file))
{
Document doc = new Document();
doc.add(new StringField("path", file.toString(), Field.Store.YES));
doc.add(new LongPoint("modified", lastModified));
doc.add(new TextField("contents", new String(Files.readAllBytes(file)), Store.YES));
writer.updateDocument(new Term("path", file.toString()), doc);
}
}
}
现在是否要在通配符中搜索“纽约”,以下是搜索功能代码
public class SearchCode
{
private static final String TRAGET = "target";
public static void main(String[] args) throws Exception
{
Directory dir = FSDirectory.open(Paths.get(TRAGET));
IndexReader reader = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer();
org.apache.lucene.queryparser.surround.parser.QueryParser surroundparser = new org.apache.lucene.queryparser.surround.parser.QueryParser();
SrndQuery srndquery = surroundparser.parse("W(new*, del*)");
query = srndquery.makeLuceneQueryField("contents", new BasicQueryFactory());
TopDocs hits = searcher.search(query, 10, Sort.INDEXORDERED);
Formatter formatter = new SimpleHTMLFormatter();
QueryScorer scorer = new QueryScorer(query);
Highlighter highlighter = new Highlighter(formatter, scorer);
Fragmenter fragmenter = new SimpleSpanFragmenter(scorer, 10);
highlighter.setTextFragmenter(fragmenter);
for (int i = 0; i < hits.scoreDocs.length; i++)
{
int docid = hits.scoreDocs[i].doc;
Document doc = searcher.doc(docid);
String title = doc.get("path");
System.out.println("Path " + " : " + title);
String text = doc.get("contents");
TokenStream stream = TokenSources.getAnyTokenStream(reader, docid, "contents", analyzer);
String[] frags = highlighter.getBestFragments(stream, text, 10);
for (String frag : frags)
{
System.out.println("=======================");
System.out.println(frag);
}
}
dir.close();
}
}
此代码适用于纯文本内容,但是当我用于在我有css代码,js代码,html标签和内容的html页面中进行搜索时,有时“ frag”返回空白值(页面中有匹配项)
请帮助我解决问题,并让我知道是否需要其他详细信息。
先谢谢。