我怎样才能得到一个IPropagatorBlock <tinput,toutput =“”>来自我停止?</tinput,>

时间:2012-11-02 20:07:56

标签: .net task-parallel-library tpl-dataflow

我们假设我从TransformBlock<Uri, string>开始(其本身就是IPropagatorBlock<Uri, string>的实现),它接受Uri,然后获取string中的内容(这是一个网络爬虫):

var downloader = new TransformBlock<Uri, string>(async uri => {
    // Download and return string asynchronously...
});

一旦我在字符串中有内容,我就会解析链接。由于页面可以包含多个链接,因此我使用TransformManyBlock<string, Uri>将单数结果(内容)映射到多个链接:

// The discovered item block.
var parser = new TransformManyBlock<string, Uri>(s => {
    // Parse the content here, return an IEnumerable<Uri>.
});

解析器的关键是它可以传回一个空序列,表明没有更多的项应该解析。

但是,这仅适用于树的一个分支(或网络的一部分)。

然后我将下载程序链接到解析器,然后返回到下载程序,如下所示:

downloader.LinkTo(parser);
parser.LinkTo(downloader);

现在,我知道我可以通过在其中一个块上调用Complete来阻止之外的所有内容)但是我怎么能从里面发信号通知它已完成块?

或者我是否必须以某种方式管理这个状态?

现在,它只是挂起,因为在下载并解析了所有内容后,下载程序块被饿了。

这是一个完全包含的测试方法,它挂在对Wait的调用上:

[TestMethod]
public void TestSpider()
{
    // The list of numbers.
    var numbers = new[] { 1, 2 };

    // Transforms from an int to a string.
    var downloader = new TransformBlock<Tuple<int, string>, string>(
        t => t.Item2 + t.Item1.ToString(CultureInfo.InvariantCulture),

        // Let's assume four downloads to a domain at a time.
        new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 4 }
    );

    // Gets the next set of strings.
    var parser = new TransformManyBlock<string, Tuple<int, string>>(s => {
        // If the string length is greater than three, return an
        // empty sequence.
        // This is the signal for this branch to stop.
        if (s.Length > 3) return Enumerable.Empty<Tuple<int, string>>();

        // Branch out.
        return numbers.Select(n => new Tuple<int, string>(n, s));
    }, 
    // These are simple transformations/parsing, no need to not parallelize.
    // The dataflow blocks will handle the task allocation.
    new ExecutionDataflowBlockOptions {
        MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded
    });

    // For broadcasting to an action.
    var parserBroadcaster = new BroadcastBlock<Tuple<int, string>>(
        // Clone.
        t => new Tuple<int, string>(t.Item1, t.Item2));

    // Indicate what was parsed.
    var parserConsumer = new ActionBlock<Tuple<int, string>>(
        t => Debug.WriteLine(
            string.Format(CultureInfo.InvariantCulture, 
                "Consumed - Item1: {0}, Item2: \"{1}\"",
            t.Item1, t.Item2)));

    // Link downloader to parser.
    downloader.LinkTo(parser);

    // Parser to broadcaster.
    parser.LinkTo(parserBroadcaster);

    // Broadcaster to consumer.
    parserBroadcaster.LinkTo(parserConsumer);

    // Broadcaster back to the downloader.
    parserBroadcaster.LinkTo(downloader);

    // Start the downloader.
    downloader.Post(new Tuple<int, string>(1, ""));

    // Wait on the consumer to finish.
    parserConsumer.Completion.Wait();
}

其输出(正如预期的那样,在它挂起之前)是:

Consumed - Item1: 1, Item2: "1"
Consumed - Item1: 2, Item2: "1"
Consumed - Item1: 1, Item2: "11"
Consumed - Item1: 2, Item2: "11"
Consumed - Item1: 1, Item2: "12"
Consumed - Item1: 2, Item2: "12"
Consumed - Item1: 1, Item2: "111"
Consumed - Item1: 2, Item2: "111"
Consumed - Item1: 1, Item2: "112"
Consumed - Item1: 2, Item2: "112"
Consumed - Item1: 1, Item2: "121"
Consumed - Item1: 2, Item2: "121"
Consumed - Item1: 1, Item2: "122"
Consumed - Item1: 2, Item2: "122"

1 个答案:

答案 0 :(得分:1)

TPL数据流框架没有任何可以处理这种开箱即用的东西。这更像是一个国家管理问题。

尽管如此,关键在于跟踪已下载的网址以及仍需要下载的网址。

理想的处理方法是解析器块;这是您拥有内容(将转换为更多下载链接)和内容下载的URL的点。

在上面的示例中,需要引入一种捕获下载结果的方法以及从中下载的URI(我本来会使用Tuple,但这会让事情变得太混乱) :

public class DownloadResult
{
    public Tuple<int, string> Uri { get; set; }
    public string Content { get; set; }
}

从那里开始,下载块几乎相同,只是更新以输出上述结构:

[TestMethod]
public void TestSpider2()
{
    // The list of numbers.
    var numbers = new[] { 1, 2 };

    // Performs the downloading.
    var downloader = new TransformBlock<Tuple<int, string>, DownloadResult>(
        t => new DownloadResult { 
            Uri = t, 
            Content = t.Item2 + 
                t.Item1.ToString(CultureInfo.InvariantCulture) 
        },

        // Let's assume four downloads to a domain at a time.
        new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 4 }
    );

解析器的使用者不需要更改,但是 需要先声明,因为解析器必须向消费者发出信号,告知消费者它应该停止消费并且我们想要捕获它在传递给解析器的闭包中:

// Indicate what was parsed.
var parserConsumer = new ActionBlock<Tuple<int, string>>(
    t => Debug.WriteLine(
        string.Format(CultureInfo.InvariantCulture, 
            "Consumed - Item1: {0}, Item2: \"{1}\"",
            t.Item1, t.Item2)));

现在必须介绍州经理:

// The dictionary indicating what needs to be processed.
var itemsToProcess = new HashSet<Tuple<int, string>>();

起初,我想过只使用ConcurrentDictionary<TKey, TValue>,但是必须围绕删除多次添加执行原子操作,它没有提供所需的内容。一个简单的lock statement是最好的选择。

解析器的变化最大。它正常地解析项目,但也原子地执行以下操作:

  • 从状态机(itemsToProcess
  • 中删除URL
  • 向状态机添加新URL。
  • 如果处理完上述内容后状态机中没有任何项目,则通过调用Complete method
  • 上的IDataflowBlock interface向消费者块发出信号通知

看起来像这样:

// Changes content into items and new URLs to download.
var parser = new TransformManyBlock<DownloadResult, Tuple<int, string>>(
    r => {
        // The parsed items.
        IEnumerable<Tuple<int, string>> parsedItems;

        // If the string length is greater than three, return an
        // empty sequence.
        // This is the signal for this branch to stop.
        parsedItems = (r.Uri.Item2.Length > 3) ? 
            Enumerable.Empty<Tuple<int, string>>() :
            numbers.Select(n => new Tuple<int, string>(n, r.Content));

        // Materialize the list.
        IList<Tuple<int, string>> materializedParsedItems = 
            parsedItems.ToList();

        // Lock here, need to make sure the removal from
        // from the items to process dictionary and
        // the addition of the new items are atomic.
        lock (itemsToProcess)
        {
            // Remove the item.
            itemsToProcess.Remove(r.Uri);

            // If the materialized list has zero items, and the new
            // list has zero items, finish the action block.
            if (materializedParsedItems.Count == 0 && 
                itemsToProcess.Count == 0)
            {
                // Complete the consumer block.
                parserConsumer.Complete();
            }

            // Add the items.
            foreach (Tuple<int, string> newItem in materializedParsedItems) 
                itemsToProcess.Add(newItem);

                // Return the items.
                return materializedParsedItems;
            }
        }, 

        // These are simple transformations/parsing, no need to not 
        // parallelize.  The dataflow blocks will handle the task 
        // allocation.
        new ExecutionDataflowBlockOptions {
            MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded
        });

广播公司和链接是相同的:

// For broadcasting to an action.
var parserBroadcaster = new BroadcastBlock<Tuple<int, string>>(
    // Clone.
    t => new Tuple<int, string>(t.Item1, t.Item2));

// Link downloader to parser.
downloader.LinkTo(parser);

// Parser to broadcaster.
parser.LinkTo(parserBroadcaster);

// Broadcaster to consumer.
parserBroadcaster.LinkTo(parserConsumer);

// Broadcaster back to the downloader.
parserBroadcaster.LinkTo(downloader);

启动块时,必须先使用要下载的URL来启动状态机,然后将根传递给Post method

// The initial post to download.
var root = new Tuple<int, string>(1, "");

// Add to the items to process.
itemsToProcess.Add(root);

// Post to the downloader.
downloader.Post(root);

Wait method上对Task class的呼叫是相同的,现在可以完成而不会挂起:

    // Wait on the consumer to finish.
    parserConsumer.Completion.Wait();
}