Question

我想编辑一个文本，因为每行都存在一次。每行包含10个字符。我一般在5-6百万行。所以我目前使用的代码占用了太多内存。

我的代码：

File.WriteAllLines(targetpath, File.ReadAllLines(sourcepath).Distinct())

那么如何让它减少RAM消费者，同时减少时间消耗？

Answer 1

考虑到how much memory a string will take in C#，并假设600个记录的长度为10个字符，我们得到：

大小以字节为单位〜= 20 +（长度/ 2）* 4;
总大小（字节数）〜=（20 +（10/2）* 4）* 6000000 = 240 000 000
总大小，Mb~ = 230

现在，即使在x86（32位系统）上，230 MB的空间也不是真正的问题，因此您可以将所有数据加载到内存中。为此，我会使用HashSet class，这显然是一个哈希集，通过在添加元素之前使用查找，可以轻松地消除重复项。

就时间复杂度的大O符号而言，哈希集中查找的平均性能为O（1），这是您可以获得的最佳性能。总共，您将使用查找N次，总计 N * O（1）= O（N）

对于空间复杂度的 big-O表示法，您将使用 O（N）空间，这意味着您使用与元素数量成比例的内存，这也是你能得到的最好的。

如果您在C＃中实现算法并且不依赖任何外部组件（也至少使用O（N）），我不确定是否可以占用更少的空间

话虽如此，您可优化某些方案，方法是逐行阅读您的文件，请参阅here。如果你有很多重复项，这会产生更好的结果，但是当所有行都是不同的最坏情况时会消耗相同数量的内存。

最后一点，如果您看一下如何实现Distinct方法，您会看到它也使用了哈希表的实现，虽然它不是同一个类，但性能仍然大致相同，请检查this question了解更多详情。

Answer 2

正如ironstone13纠正我的那样，HashSet没问题，但确实存储了数据。然后这也很好用：

        string[] arr = File.ReadAllLines("file.txt");
        HashSet<string> hashes = new HashSet<string>();

        for (int i = 0; i < arr.Length; i++)
        {
            if (!hashes.Add(arr[i])) arr[i] = null;
        }

        File.WriteAllLines("file2.txt", arr.Where(x => x != null));

这种实现是由内存性能和哈希冲突引起的。主要的想法是保持哈希值，当然它必须返回文件以获得它看到的哈希冲突/双重线，以检测它是哪一个。（该部分未实施）。

class Program
{
    static string[] arr;
    static Dictionary<int, int>[] hashes = new Dictionary<int, int>[1]
    { new Dictionary<int, int>() }
    ;
    static int[] file_indexes = {-1};


    static void AddHash(int hash, int index)
    {
        for (int h = 0; h < hashes.Length; h++)
        {
            Dictionary<int, int> dict = hashes[h];
            if (!dict.ContainsKey(hash))
            {
                dict[hash] = index;
                return;
            }
        }
        hashes = hashes.Union(new[] {new Dictionary<int, int>() {{hash, index}}}).ToArray();
        file_indexes = Enumerable.Range(0, hashes.Length).Select(x => -1).ToArray();
    }

    static int UpdateFileIndexes(int hash)
    {
        int updates = 0;
        for (int h = 0; h < hashes.Length; h++)
        {
            int index;
            if (hashes[h].TryGetValue(hash, out index))
            {
                file_indexes[h] = index;
                updates++;
            }
            else
            {
                file_indexes[h] = -1;
            }
        }
        return updates;
    }

    static bool IsDuplicate(int index)
    {
        string str1 = arr[index];
        for (int h = 0; h < hashes.Length; h++)
        {
            int i = file_indexes[h];
            if (i == -1 || index == i) continue;
            string str0 = arr[i];
            if (str0 == null) continue;
            if (string.CompareOrdinal(str0, str1) == 0) return true;
        }
        return false;
    }


    static void Main(string[] args)
    {
        arr = File.ReadAllLines("file.txt");

        for (int i = 0; i < arr.Length; i++)
        {
            int hash = arr[i].GetHashCode();

            if (UpdateFileIndexes(hash) == 0) AddHash(hash, i);
            else if (IsDuplicate(i)) arr[i] = null;
            else AddHash(hash, i);
        }

        File.WriteAllLines("file2.txt", arr.Where(x => x != null));



        Console.WriteLine("DONE");
        Console.ReadKey();
    }
}

Answer 3

在您编写数据之前，如果您的数据在列表或字典中，您可以运行LINQ查询并使用group by将所有类似的键分组。然后为每个写入输出文件。

你的问题也有点模糊。您是否每次都创建下一个文本文件，是否必须将数据存储在文本中？有更好的格式可供使用，例如XML和json

如何有效地从大文本文件中删除重复的行？

3 个答案: