Question

我有一个带有1122 x 1122降水量测量矩阵的文本文件。每个测量用4个十进制数字表示。示例行看起来像这样：

0.0234 0.0023 0.0123 0.3223 0.1234 0.0032 0.1236 0.0000 ....

（这1122值很长，1122线下降。

我需要相同的文本文件，但所有值除以6 。（我必须为920个文件执行此操作....）

我成功地做到了这一点，但毫无疑问是残忍的无效和记忆穷尽的方式：

我逐个打开文本文件，逐行阅读每个文本文件
我将每一行拆分为一个字符串数组，其中单独的值为成员
我浏览数组，将每个值转换为double，除以6并将结果转换回字符串，格式化为4位十进制数字，并作为成员存储在新的字符串数组中。
我将数组加入一行
我将此行写入新的文本文件。
Voila（大约一个小时后......）我有920个新的文本文件。

我相信有更快更专业的方法来做到这一点。我查看了有关Matrix.Divide的无尽网站，但没有看到（或理解）针对此问题的解决方案。任何帮助将不胜感激！这是用于每个文件的代码段：



    foreach (string inputline in inputfile)
    {
        int count = 0;
        string[] str_precip = inputline.Split(' ');  // holds string measurements
        string[] str_divided_precip = new string[str_precip.Length]; // will hold string measurements divided by divider (6)
        foreach (string measurements in str_precip)
        {
            str_divided_precip[count] = ((Convert.ToDouble(measurements)) / 6).ToString("F4", CultureInfo.CreateSpecificCulture("en-US"));
            count++;
        }
        string divline = string.Join(" ", str_divided_precip);
        using (System.IO.StreamWriter newfile = new System.IO.StreamWriter(@"asc_files\divfile.txt", true))
        {
            newfile.WriteLine(divline);
        }
    }

Answer 1

假设文件格式正确，您基本上应该能够一次处理一个字符而无需创建任何数组或进行任何复杂的字符串解析。

此代码段显示了一般方法：

string s = "12.4567 0.1234\n"; // just an example
decimal d = 0;
foreach (char c in s)
{
    if (char.IsDigit(c))
    {
        d *= 10;
        d += c - '0';
    }
    else if (c == ' ' || c == '\n')
    {
        d /= 60000; // divide by 10000 to get 4dps; divide by 6 here too
        Console.Write(d.ToString("F4"));
        Console.Write(c);
        d = 0;
    }
    else {
        // no special processing needed as long as input file always has 4dp
        Debug.Assert(c == '.');
    }
}

显然，您将写入（缓冲的）文件流而不是控制台。

你可能会推出自己更快版本的ToString("F4")，但我怀疑它会对时间产生重大影响。但是如果你可以通过这种方法避免为输入文件的每一行创建一个新数组，我希望它能产生实质性的差异。（相比之下，每个文件作为缓冲编写器的一个数组是值得的，特别是如果从一开始就宣布它足够大的话。）

修改（ by Sani Singh Huttunen ）
很抱歉编辑你的帖子，但你对此完全正确。
定点算术将在这种情况下提供显着的改进。

在介绍StreamReader（约10％的改善），float（另外约35％的改善）和其他改进（还有另外约20％的改善）（见评论）后，此方法需要约12分钟（我的答案中的系统规格）：

public void DivideMatrixByScalarFixedPoint(string inputFilname, string outputFilename)
{
    using (var inFile = new StreamReader(inputFilname))
    using (var outFile = new StreamWriter(outputFilename))
    {
        var d = 0;

        while (!inFile.EndOfStream)
        {
            var c = (char) inFile.Read();
            if (c >= '0' && c <= '9')
            {
                d = (d * 10) + (c - '0');
            }
            else if (c == ' ' || c == '\n')
            {
                // divide by 10000 to get 4dps; divide by 6 here too
                outFile.Write((d / 60000f).ToString("F4", CultureInfo.InvariantCulture.NumberFormat));
                outFile.Write(c);
                d = 0;
            }
        }
    }
}

Answer 2

You open/close the output for every value, I think we can do better! Just replace it with this code:

using (System.IO.StreamWriter newfile = new System.IO.StreamWriter(@"asc_files\divfile.txt", true))
{
    foreach (string inputline in inputfile)
    {
        int count = 0;
        foreach (string measurements in inputline.Split(' '))
        {
            newfile.Write((Convert.ToDouble(measurements) / 6).ToString("F4", CultureInfo.CreateSpecificCulture("en-US")));
            if (++count < 1122)
            {
                newfile.Write(" ");
            }
        }

        newfile.WriteLine();
    }
}

For the reading part, you may want to read one line at a time with ReadLine() instead of reading the whole file in a huge block and then splitting it in-memory. This streaming approach will greatly reduce memory allocation and based on hardware (how much memory you have, how fast your disks (HDD? SSD?) are) may enhance performance in a sensible way!

Let me please know how it works now, I'm very curious!

Answer 3

Math.NET Numerics对这类操作很有用应该快速且占用内存很小。

using MathNet.Numerics.Data.Text;
using MathNet.Numerics.LinearAlgebra;

public void DivideMatrixByScalar(string inputFilename, string outputFilename, double scalar)
{
    Matrix<double> matrix;

    using (var sr = new StreamReader(inputFilename))
    {
        matrix = DelimitedReader.Read<double>(sr, false, "\\s", false, CultureInfo.InvariantCulture.NumberFormat);
    }

    // Divide all values with the scalar.
    matrix = matrix.Divide(scalar);

    using (var sw = new StreamWriter(outputFilename))
    {
        DelimitedWriter.Write(sw, matrix, " ", null, "0.0000", CultureInfo.InvariantCulture.NumberFormat);
    }
}

<强>更新
花费1122x1122双倍值的920个文件所花费的时间：~43分钟内存占用：最大129 MB，平均：59 MB CPU使用率：最大20％，平均值：18％

结论是，这非常I / O很重，这是大部分时间都需要的 SSD甚至更好的RAID固态硬盘都可以加快速度。

系统规格
硬盘WD20EARS 5400 RPM
24GB DDR3 @ 2133 MHz
Inte Core i7 950 @ 3.07 GHz

C＃ - 带矩阵的文本文件 - 划分所有条目

3 个答案: