Question

我想要一个代码演示或一些想法来构建带有spark集群的lucene索引。我尝试了一些方法。但我仍然不知道如何在spark中使用lucene的IndexWriter。我的输入数据如下所示：sellerId，productId，title

我希望输出是lucene的索引文件。

Answer 1

您可以查看spark-lucenerdd。有关库的快速介绍，请查看slides。

免责声明：我是图书馆的作者。

Answer 2

我实际上已经完成了以下工作，并且效果很好：

public class LuceneDatasetIndexService implements DatasetIndexService, Serializable
{

    private static final String SCHEMA_JSON = "schema.json";

    @Autowired
    private transient FileUtility fileUtility;

    /**
     * 
     */
    private static final long serialVersionUID = 1L;

    @Override
    public void indexDataset(Dataset<Row> dataset, Path indexStorePath) throws IOException
    {
        if (!fileUtility.exists(indexStorePath))
        {
            Files.createDirectories(indexStorePath);
            Path schemaPath = indexStorePath.resolve(SCHEMA_JSON);
            String prettyJson = dataset.schema().prettyJson();
            Files.copy(new ByteArrayInputStream(prettyJson.getBytes()), schemaPath, StandardCopyOption.REPLACE_EXISTING);
        }
        //Path is not serializable
        String path = indexStorePath.toString();
        dataset.foreachPartition(new ForeachPartitionFunction<Row>()
        {
            /**
             * 
             */
            private static final long serialVersionUID = 1L;

            @Override
            public void call(Iterator<Row> t) throws Exception
            {
                Path indexPath = Paths.get(path);
                StructType schema = SparkSchemaUtil.readSparkSchemaFromFile(indexPath.resolve(SCHEMA_JSON));
                RowIndexWriter writer = null;
                while (writer == null)
                {
                    try
                    {
                        writer = new LuceneRowIndexWriter(schema, indexPath);
                    } catch (LockObtainFailedException e)
                    {
                        Thread.sleep(100);
                    }
                }

                try {
                    while (t.hasNext())
                    {
                        writer.indexRow(t.next());
                    }
                }
                finally 
                {
                    writer.close();
                }
            }
        });
    }

    @Override
    public List<Row> find(Path indexStorePath, Collection<Filter> filters, int size) throws IOException
    {
        Path schemaPath = indexStorePath.resolve(SCHEMA_JSON);

        StructType schema = SparkSchemaUtil.readSparkSchemaFromFile(schemaPath);

        RowIndexSearcher searcher = new LuceneRowIndexSearcher(schema, indexStorePath);
        return searcher.search(filters, indexStorePath, size);
    }

这样，就无需在节点之间传输数据以建立索引，并且每个分区一次写入一个Lucene索引（您不能同时打开多个索引器）

如何使用spark cluster

2 个答案: