我想要一个代码演示或一些想法来构建带有spark集群的lucene索引。 我尝试了一些方法。但我仍然不知道如何在spark中使用lucene的IndexWriter。 我的输入数据如下所示:sellerId,productId,title
我希望输出是lucene的索引文件。
答案 0 :(得分:1)
您可以查看spark-lucenerdd。有关库的快速介绍,请查看slides。
免责声明:我是图书馆的作者。
答案 1 :(得分:0)
我实际上已经完成了以下工作,并且效果很好:
public class LuceneDatasetIndexService implements DatasetIndexService, Serializable
{
private static final String SCHEMA_JSON = "schema.json";
@Autowired
private transient FileUtility fileUtility;
/**
*
*/
private static final long serialVersionUID = 1L;
@Override
public void indexDataset(Dataset<Row> dataset, Path indexStorePath) throws IOException
{
if (!fileUtility.exists(indexStorePath))
{
Files.createDirectories(indexStorePath);
Path schemaPath = indexStorePath.resolve(SCHEMA_JSON);
String prettyJson = dataset.schema().prettyJson();
Files.copy(new ByteArrayInputStream(prettyJson.getBytes()), schemaPath, StandardCopyOption.REPLACE_EXISTING);
}
//Path is not serializable
String path = indexStorePath.toString();
dataset.foreachPartition(new ForeachPartitionFunction<Row>()
{
/**
*
*/
private static final long serialVersionUID = 1L;
@Override
public void call(Iterator<Row> t) throws Exception
{
Path indexPath = Paths.get(path);
StructType schema = SparkSchemaUtil.readSparkSchemaFromFile(indexPath.resolve(SCHEMA_JSON));
RowIndexWriter writer = null;
while (writer == null)
{
try
{
writer = new LuceneRowIndexWriter(schema, indexPath);
} catch (LockObtainFailedException e)
{
Thread.sleep(100);
}
}
try {
while (t.hasNext())
{
writer.indexRow(t.next());
}
}
finally
{
writer.close();
}
}
});
}
@Override
public List<Row> find(Path indexStorePath, Collection<Filter> filters, int size) throws IOException
{
Path schemaPath = indexStorePath.resolve(SCHEMA_JSON);
StructType schema = SparkSchemaUtil.readSparkSchemaFromFile(schemaPath);
RowIndexSearcher searcher = new LuceneRowIndexSearcher(schema, indexStorePath);
return searcher.search(filters, indexStorePath, size);
}
这样,就无需在节点之间传输数据以建立索引,并且每个分区一次写入一个Lucene索引(您不能同时打开多个索引器)