Question

我正在使用Lucene.Net +自定义爬虫+ Ifilter，以便我可以在blob中索引数据。

foreach (var item in containerList)
            {
                CloudBlobContainer container = BlobClient.GetContainerReference(item.Name);
                if (container.Name != "indexes")
                {
                    IEnumerable<IListBlobItem> blobs = container.ListBlobs();
                    foreach (CloudBlob blob in blobs)
                    {
                        CloudBlobContainer blobContainer = blob.Container;
                        CloudBlob blobToDownload = blobContainer.GetBlobReference(blob.Name);

                        blob.DownloadToFile(path+blob.Name);
                        indexer.IndexBlobData(path,blob);
                        System.IO.File.Delete(path+blob.Name);
                    }
                }
            }
/*Code for crawling which downloads file Locally on azure instance storage*/

以下代码是使用IFilter

的索引器函数

public bool IndexBlobData(string path, CloudBlob blob)
    {
        Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();
        try
        {
            TextReader reader = new FilterReader(path + blob.Name);
            doc.Add(new Lucene.Net.Documents.Field("url", blob.Uri.ToString(), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.NOT_ANALYZED));
            doc.Add(new Lucene.Net.Documents.Field("content", reader.ReadToEnd().ToString(), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.ANALYZED));
            indexWriter.AddDocument(doc);
            reader.Close();
            return true;
        }
        catch (Exception e)
        {
            return false;
        }
    }

现在我的问题是我不想在实例存储上下载文件..我直接想要将文件传递给FilterReader。但它需要“物理”路径，传递http地址不起作用。任何人都可以建议任何其他解决方法吗？我不想再从blob下载相同的文件然后将其编入索引，而是我更喜欢下载并将其保存在主内存中并直接使用索引过滤器。

我正在使用IFilter from here

Answer 1

I don't want to download same file again from blob and then index it, instead i will prefer download and keep it in main memory and directly use index filter你的意思不是很清楚？那是什么main memory - Azure Blob存储或本地实例内存。

由于IFilter界面的性质，您所面临的问题无法解决。如果您对使用from here的来源有所了解，您会发现它使用了IPersistFile COM interface。不幸的是，此接口仅适用于本地文件，不接受流。

我建议使用Blob中的Stream并将其传递给Reader，而不是物理路径。但是，如前所述 - IFilter使用仅适用于物理路径的COM接口。因此，使用您当前的方法，无法跳过blob下载。

在本地下载blob没什么可怕的。如果存储帐户与计算机位于同一个关联组中，则下载速度非常快，流量将是免费的。鉴于您使用small instance size，您将拥有165GB的本地存储空间。这是充足的存储空间。您可以通过跟踪索引和不索引的内容来优化您的过程。您可以使用Azure表存储。另一种极其快速且便宜的存储解决方案，非常适合存储键值对file name - etag。然后，当您枚举blob时，首先获取blob的etag并检查表是否已经编入索引。仅在未编制索引时下载，然后将新记录添加到表中以将此文件标记为已编制索引。

或......或者不要使用IFilter。我认为在Azure上使用IFilter没有任何好处。 IFilter仅在安装应用程序时注册。例如，如果要使用IFilter处理Office文档 - 您必须在VM上安装Microsoft Office（由于MS Office的许可移动性限制，即使您拥有许可证，目前也无法执行此操作）。如果你想获得IFilter for PDF - 你必须安装Adobe Acrobat Reader（你可以通过启动任务来完成）。等等，等等 - 你可以安装一些应用程序，有些则不能安装。您的Windows Azure VM实例是普通的Windows，根本没有IFilter。想象一下Windows Server 2008 R2的基本安装，没有任何角色，也没有添加任何功能 - 这就是您的实例。

使用Lucene.NET和C＃索引blob内的数据

1 个答案: