Question

我有2GB文件（其中9个），其中包含大约12M个字符串记录，我想将每个文件作为文档插入本地mongodb（windows）。

现在我逐行阅读并插入每一行（第一行是不必要的标题），如下所示：

bool readingFlag = false;
foreach (var line in File.ReadLines(file))
{
    if (readingflag)
    {
        String document = "{'read':'" + line + "'}";
        var documnt = new BsonDocument(
             MongoDB
             .Bson
             .Serialization
             .BsonSerializer
             .Deserialize<BsonDocument>(document));

        await collection.InsertOneAsync(documnt);
        readingflag = false;
    }
    else
    {
        readingflag = true;
    }
}

这种方法有效但不如我预期的那么快。我现在位于文件的中间，我认为只需一个文件就可以在大约4个小时内结束。（所有数据都是40小时）

我认为我的瓶颈是文件读取，但因为它是一个非常大的文件VS不会让我把它加载到内存中（内存异常）。

我有没有其他方式在这里失踪？

Answer 1

我认为我们可以利用这些东西：

获取一些行并通过插入许多
在单独的线程上插入数据，因为我们不需要等待完成
使用类型化的类TextData将序列化推送到其他线程

您可以立即玩限制 - 因为这取决于从文件中读取的数据量

public class TextData{
    public ObjectId _id {
        get;
        set;
    }
    public string read {
        get;
        set;
    }
}

public class Processor{
    public async void ProcessData() {
        var client = new MongoClient("mongodb://localhost:27017");
        var database = client.GetDatabase("test");

        var collection = database.GetCollection < TextData > ("Yogevnn");
        var readingflag = false;
        var listOfDocument = new List < TextData > ();
        var limiAtOnce = 100;
        var current = 0;

        foreach(var line in File.ReadLines( @ "E:\file.txt")) {
            if (readingflag) {
                var dataToInsert = new TextData {
                    read = line
                };
                listOfDocument.Add(dataToInsert);
                readingflag = false;
                Console.WriteLine($ "Current position: {current}");

                if (++current == limiAtOnce) {
                    current = 0;
                    Console.WriteLine($ "Inserting data");
                    var listToInsert = listOfDocument;

                    var t = new Task(() =  > {
                                Console.WriteLine($ "Inserting data START");
                                collection.InsertManyAsync(listToInsert);
                                Console.WriteLine($ "Inserting data FINISH");
                            });
                    t.Start();
                    listOfDocument = new List < TextData > ();
                }
            } else {
                readingflag = true;
            }
        }

        // insert remainder
        await collection.InsertManyAsync(listOfDocument);
    }
}

欢迎任何评论！

Answer 2

在我的实验中，我发现Parallel.ForEach(File.ReadLines("path"))是最快的。文件大小约为42 GB。我也试过批处理一组100行并保存批处理但比Parallel.ForEach慢。

另一个例子：Read large txt file multithreaded?

将巨大的文件（2G）插入mongodb

2 个答案: