Question

我正在尝试解析一个300MB的csv文件并将其保存在mongodb上。为此，我需要将此csv文件转换为BsonDocument列表，其中包含创建文档的键值对。 csv文件中的每一行都是一个新的BsonDocument。每隔几分钟进行并行测试，我就会在拆分操作上获得OOM异常。我读过非常有趣的this文章。但我找不到任何可以在这些巨大文件上实施的实用解决方案。

我正在研究不同的csv助手，但找不到任何可以解决这个问题的东西。

非常感谢任何帮助。

Answer 1

你应该能够像这样逐行阅读：

public static void Main()
{
    using (StreamReader sr = new StreamReader(path))
    {
        string[] headers = null;
        string[] curLine;
        while ((curLine = sr.ReadLine().Split(',')) != null)
        {
            if (firstLine == null)
            {
                headers = curLine;
            }
            else
            {
                processLine(headers, curLine);
            }
        }
    }

}

public static void processLine(string[] headers, string[] line)
{
    for (int i = 0; i < headers.Length)
    {
        string header = headers[i];
        string line = line[i];

        //Now you have individual header/line pairs that you can put into mongodb
    }
}

我从未使用过mongodb而且我不知道你的csv或你的mongo的结构，所以我不能在那里帮助很多。希望你能从这里得到它。如果没有，请编辑您的帖子，其中包含有关如何构建您的mongodb的更多详细信息，并希望有人会发布更有帮助的答案。

Answer 2

谢谢@dbc那个有用！ @ashbygeek，我需要将其添加到您的代码中，

 while (!sr.EndOfStream && (curLine = sr.ReadLine().Split('\t')) != null)
 {
     //do process
 }

所以我上传了我的代码，我从Azure blob获取了我的大CSV文件，并在Batch中插入mongoDB而不是每个文档。我还创建了自己的主键哈希和索引，以便识别重复文档，如果我找到了一个，我将开始逐个插入它们以识别副本。

我希望将来对某人有所帮助。

using (TextFieldParser parser = new TextFieldParser(blockBlob2.OpenRead()))
        {
            parser.TextFieldType = FieldType.Delimited;
            parser.SetDelimiters("\t");
            bool headerWritten = false;
            List<BsonDocument> listToInsert = new List<BsonDocument>();
            int chunkSize = 50;
            int counter = 0;
            var headers = new string[0];

            while (!parser.EndOfData)
            {
                //Processing row
                var fields = parser.ReadFields();

                if (!headerWritten)
                {
                    headers = fields;
                    headerWritten = true;
                    continue;
                }

                listToInsert.Add(new BsonDocument(headers.Zip(fields, (k, v) => new { k, v }).ToDictionary(x => x.k, x => x.v)));
                counter++;

                if (counter != chunkSize) continue;
                AdditionalInformation(listToInsert, dataCollectionQueueMessage);
                CalculateHashForPrimaryKey(listToInsert);
                await InsertDataIntoDB(listToInsert, dataCollectionQueueMessage);
                counter = 0;
                listToInsert.Clear();
            }

            if (listToInsert.Count > 0)
            {
                AdditionalInformation(listToInsert, dataCollectionQueueMessage);
                CalculateHashForPrimaryKey(listToInsert);
                await InsertDataIntoDB(listToInsert, dataCollectionQueueMessage);
            }
        }



 private  async Task InsertDataIntoDB(List<BsonDocument>listToInsert, DataCollectionQueueMessage dataCollectionQueueMessage)
    {
        const string connectionString = "mongodb://127.0.0.1/localdb";

        var client = new MongoClient(connectionString);

        _database = client.GetDatabase("localdb");

        var collection = _database.GetCollection<BsonDocument>(dataCollectionQueueMessage.CollectionTypeEnum.ToString());

        await collection.Indexes.CreateOneAsync(new BsonDocument("HashMultipleKey", 1), new CreateIndexOptions() { Unique = true, Sparse = true, });

        try
        {
               await collection.InsertManyAsync(listToInsert);
        }
        catch (Exception ex)
        {
            ApplicationInsights.Instance.TrackException(ex);

            await InsertSingleDocuments(listToInsert, collection, dataCollectionQueueMessage);
        }
    }



private  async Task InsertSingleDocuments(List<BsonDocument> dataCollectionDict, IMongoCollection<BsonDocument> collection
        ,DataCollectionQueueMessage dataCollectionQueueMessage)
    {
        ApplicationInsights.Instance.TrackEvent("About to start insert individual documents and to find the duplicate one");

        foreach (var data in dataCollectionDict)
        {
            try
            {
                 await collection.InsertOneAsync(data);
            }
            catch (Exception ex)
            {
                ApplicationInsights.Instance.TrackException(ex,new Dictionary<string, string>() {
                    {
                        "Error Message","Duplicate document was detected, therefore ignoring this document and continuing to insert the next docuemnt"
                    }, {
                        "FilePath",dataCollectionQueueMessage.FilePath
                    }}
                );
            }
        }
    }

如何在不使用拆分（Out Of Memory问题）C＃的情况下将大型csv文件转换为json

2 个答案: