Question

我有一些解析CSV的代码，它意味着批量处理任务并处理记录。我有一些类型的并发问题，我得到一个传递给SplitCsvLineToCells的空字符串。疯狂的是，当我在visual studio中迁移调用堆栈时，我可以看到传入的数组索引/字符串，并且它不是null！我是否有可能进行垃圾收集并丢失字符串引用？我现在画了一点空白。传入的任何字符串都不应为null（基于输入，字符串肯定都是填充的）。这是代码：

    static SemaphoreSlim semaphore;
    static MemoryStream outputStream;
    static StreamWriter writer;
    static StreamReader reader;
    static string[] headers;
    static int readCount = 0;
    static int BATCH_SIZE = 25;

    public static void Main(string[] args)
    {
        var path = args[0];
        semaphore = new SemaphoreSlim(Environment.ProcessorCount);

        var csv = File.Open(path, FileMode.Open, FileAccess.Read);
        outputStream = new MemoryStream();
        writer = new StreamWriter(outputStream, Encoding.UTF8);
        reader = new StreamReader(csv, Encoding.UTF8);
        headers = SplitCsvLineToCells(reader.ReadLine());
        List<Task> tasks = new List<Task>();

        var lines = new string[BATCH_SIZE];
        var currentIndex = 0;
        while (!reader.EndOfStream)
        {
            lines[currentIndex] = reader.ReadLine();
            currentIndex++;
            readCount++;

            if (readCount % BATCH_SIZE == 0)
            {
                semaphore.Wait();
                var task = new Task(() => ProcessRecords(lines));
                task.Start();
                tasks.Add(task);
                lines = new string[BATCH_SIZE];
                currentIndex = 0;
            }
        }

        Task.WaitAll(tasks.ToArray());
        Console.WriteLine("complete.");
    }

    static void ProcessRecords(string[] lines)
    {
        try
        {
            var uploads = new List<Dictionary<string, AttributeValue>>();
            for (var i = 0; i < lines.Length; i++)
            {
                string[] parsedLine = SplitCsvLineToCells(lines[i]); // in the debugger when moving up the call stack, lines[i] is not null.
                var outputObject = new Dictionary<string, AttributeValue>();
                for (var j = 0; j < headers.Length && j < parsedLine.Length; j++)
                {
                    if (!string.IsNullOrEmpty(parsedLine[j]))
                        outputObject.Add(headers[j], new AttributeValue() { S = parsedLine[j] == "" ? null : parsedLine[j] });
                }
                uploads.Add(outputObject);
            }
            // GO DO MORE STUFF HERE
        }
        catch (System.IO.IOException ex)
        {
            Console.WriteLine("Processing failed: {0}", ex.Message);
        }
        finally
        {
            semaphore.Release();
        }
    }

    static string[] SplitCsvLineToCells(string line, char delimeter = ',') 
    {
        // the line in here shows as null
        // but in the debugger the calling function string isn't null
    }

Answer 1

您希望解决此问题的方法是创建少量持久性任务，这些任务从为多个生产者和使用者设计的单个队列中读取。像这样：

// The queue, initialized with a maximum capacity of 25 lines.
// Increase or decrease depending on your needs.
private BlockingCollection<string> linesQueue = new BlockingCollection<string>(25);

// in your Main

var task1 = Task.Factory.StartNew(() => ProcessLines, TaskCreationOptions.LongRunning);
var task2 = Task.Factory.StartNew(() => ProcessLines, TaskCreationOptions.LongRunning);


// The producer reads lines and adds them to the queue
foreach (var line in File.ReadLines(inputFilename))
{
    linesQueue.Add(line);
}

// Tell the queue that no more data is forthcoming.
linesQueue.CompleteAdding();

// wait for the consumers to complete
task1.Wait();
task2.Wait();

// and your ProcessRecords method
void ProcessRecords()
{
    // do whatever initialization you want here
    foreach (var line in linesQueue.GetConsumingEnumerable())
    {
        // split the line and do what you want with the result
    }
}

这种结构简单，经过验证且有效。它还具有灵活性，您可以拥有尽可能多的生产者和消费者。它使用持久性线程而不是为每个小批量生成一个新线程，这会导致由于线程启动开销而导致性能下降。

如果处理线程需要输出到公共位置，则可以创建线程写入的单独输出队列（另一个BlockingCollection），以及读取队列并将数据写入文件的另一个持久任务

有关所有这些工作原理的详细信息，请参阅我的博文Simple multithreading, Part 2。

调用方法时在任务中获取空值显示非空字符串

1 个答案: