Question

我的任务是从CSV文件中提取几十万行，其中行包含指定的ID。因此，我在字符串列表中存储了大约300,000个ID，并且需要提取CSV中包含任何这些ID的任何行。在那一刻我使用Linq语句来查看每行是否包含List中的任何ID：

using (StreamReader sr = new StreamReader(csvFile))
{             
    string inLine = sr.ReadLine();
    if(searchStrings.Any(sr.ReadLine().Contains))
    {
         stremWriter.Write(inLine);
    }
}

这种工作正常，但它非常慢，因为searchStrings列表中有300,000个值，而我需要搜索的CSV中有几百万行。

有谁知道如何让这种搜索更有效率来加快速度？或者提取所需行的替代方法？

由于

Answer 1

之前我遇到过类似的问题，我不得不迭代几十万行.csv并解析每一行。

我采用了线程方法，我尝试分批同时进行读取和解析。这是粗略我是如何做到的;

    using System.Collections.Concurrent; using System.Threading;
    private static ConcurrentBag<String> items = new ConcurrentBag<String>();
    private static List<String> searchStrings;
    static void Main(string[] args)
    {

        using (StreamReader sr = new StreamReader(csvFile))
        {
            const int buffer_size = 10000;
            string[] buffer = new string[buffer_size];

            int count = 0;
            String line = null;
            while ((line = sr.ReadLine()) != null)
            {
                buffer[count] = line;
                count++;
                if (count == buffer_size)
                {
                    new Thread(() =>
                        {
                            find(buffer);
                        }).Start();

                    buffer = new String[buffer_size];
                    count = 0;
                }
            }

            if (count > 0)
            {
                find(buffer);
            }

            //some kind of sync here, can be done with a bool - make sure all the threads have finished executing
            foreach (var str in searchStrings)
                streamWriter.write(str);
        }
    }

    private static void find(string[] buffer)
    {
        //do your search algorithm on the array of strings
       //add to the concurrentbag if they match
    }

我只是很快将这段代码从我记得的事情中扔出来，所以它可能不完全正确。这样做肯定会加快速度（至少有非常大的文件）。

我们的想法是始终从hdd读取，因为字符串解析可能非常昂贵，因此将工作分配到多个核心可以使它显着更快。

通过这个，我能够解析（将每一行分成大约50个项目并解析键/值对并在内存中构建对象 - 迄今为止最耗时的部分）在大约7s内大约250k行。

Answer 2

只是抛弃它，它与你问题上的任何标签都没有特别相关，但* nix“grep -f”功能可以在这里工作。基本上，你有一个文件包含你想要匹配的字符串列表（例如，StringsToFind.txt），你有你的csv输入文件（例如，input.csv），以下命令会输出匹配的行output.csv

grep -f StringsToFind.txt input.csv > output.csv

有关详细信息，请参阅grep man page。

根据ID列表从CSV中选择行

2 个答案: