我有一些解析CSV的代码,它意味着批量处理任务并处理记录。我有一些类型的并发问题,我得到一个传递给SplitCsvLineToCells
的空字符串。疯狂的是,当我在visual studio中迁移调用堆栈时,我可以看到传入的数组索引/字符串,并且它不是null!我是否有可能进行垃圾收集并丢失字符串引用?我现在画了一点空白。传入的任何字符串都不应为null(基于输入,字符串肯定都是填充的)。这是代码:
static SemaphoreSlim semaphore;
static MemoryStream outputStream;
static StreamWriter writer;
static StreamReader reader;
static string[] headers;
static int readCount = 0;
static int BATCH_SIZE = 25;
public static void Main(string[] args)
{
var path = args[0];
semaphore = new SemaphoreSlim(Environment.ProcessorCount);
var csv = File.Open(path, FileMode.Open, FileAccess.Read);
outputStream = new MemoryStream();
writer = new StreamWriter(outputStream, Encoding.UTF8);
reader = new StreamReader(csv, Encoding.UTF8);
headers = SplitCsvLineToCells(reader.ReadLine());
List<Task> tasks = new List<Task>();
var lines = new string[BATCH_SIZE];
var currentIndex = 0;
while (!reader.EndOfStream)
{
lines[currentIndex] = reader.ReadLine();
currentIndex++;
readCount++;
if (readCount % BATCH_SIZE == 0)
{
semaphore.Wait();
var task = new Task(() => ProcessRecords(lines));
task.Start();
tasks.Add(task);
lines = new string[BATCH_SIZE];
currentIndex = 0;
}
}
Task.WaitAll(tasks.ToArray());
Console.WriteLine("complete.");
}
static void ProcessRecords(string[] lines)
{
try
{
var uploads = new List<Dictionary<string, AttributeValue>>();
for (var i = 0; i < lines.Length; i++)
{
string[] parsedLine = SplitCsvLineToCells(lines[i]); // in the debugger when moving up the call stack, lines[i] is not null.
var outputObject = new Dictionary<string, AttributeValue>();
for (var j = 0; j < headers.Length && j < parsedLine.Length; j++)
{
if (!string.IsNullOrEmpty(parsedLine[j]))
outputObject.Add(headers[j], new AttributeValue() { S = parsedLine[j] == "" ? null : parsedLine[j] });
}
uploads.Add(outputObject);
}
// GO DO MORE STUFF HERE
}
catch (System.IO.IOException ex)
{
Console.WriteLine("Processing failed: {0}", ex.Message);
}
finally
{
semaphore.Release();
}
}
static string[] SplitCsvLineToCells(string line, char delimeter = ',')
{
// the line in here shows as null
// but in the debugger the calling function string isn't null
}
答案 0 :(得分:2)
您希望解决此问题的方法是创建少量持久性任务,这些任务从为多个生产者和使用者设计的单个队列中读取。像这样:
// The queue, initialized with a maximum capacity of 25 lines.
// Increase or decrease depending on your needs.
private BlockingCollection<string> linesQueue = new BlockingCollection<string>(25);
// in your Main
var task1 = Task.Factory.StartNew(() => ProcessLines, TaskCreationOptions.LongRunning);
var task2 = Task.Factory.StartNew(() => ProcessLines, TaskCreationOptions.LongRunning);
// The producer reads lines and adds them to the queue
foreach (var line in File.ReadLines(inputFilename))
{
linesQueue.Add(line);
}
// Tell the queue that no more data is forthcoming.
linesQueue.CompleteAdding();
// wait for the consumers to complete
task1.Wait();
task2.Wait();
// and your ProcessRecords method
void ProcessRecords()
{
// do whatever initialization you want here
foreach (var line in linesQueue.GetConsumingEnumerable())
{
// split the line and do what you want with the result
}
}
这种结构简单,经过验证且有效。它还具有灵活性,您可以拥有尽可能多的生产者和消费者。它使用持久性线程而不是为每个小批量生成一个新线程,这会导致由于线程启动开销而导致性能下降。
如果处理线程需要输出到公共位置,则可以创建线程写入的单独输出队列(另一个BlockingCollection
),以及读取队列并将数据写入文件的另一个持久任务
有关所有这些工作原理的详细信息,请参阅我的博文Simple multithreading, Part 2。