Question

我需要从文件中读取大量的二进制数据。我有一个固定的记录大小（38），并希望一次跳过几个记录。我尝试过使用FileStrea，Position或Seek这样做，但似乎也需要花费很长时间。因此，即使我跳过10条记录 - 我也不会通过文件读取10次fatsre。

这是一个SSCCE。

主持人注意：这不是一个重复的问题，它是我从另一个问题中提取的后续内容，以便探讨不同的焦点。

您需要创建2个按钮Serialize和Deserialize。

Serialize创建一个虚拟数据文件。

反序列化读取它。

注释掉fs.Position行以查看整个文件的原始读取。我的机器需要12秒。然后取消注释它，文件每次将跳过10条记录。希望速度提高10倍，但我的机器需要8秒。所以我假设改变fs.Position是昂贵的。

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using ProtoBuf;
using System.IO;
using System.Diagnostics;

namespace BinTest3
{


    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void Serialize_Click(object sender, EventArgs e)
        {

            FileStream outBin = null;

            string binFileName = @"C:\binfile.dft";
            outBin = File.Create(binFileName, 2048, FileOptions.None);

            DateTime d = DateTime.Now;

            TickRecord tr = new TickRecord(d, 1.02, 1.03,200,300);

            for (int i =0; i < 20000000; i++)
            {
                tr.BidPrice += 1;
                Serializer.SerializeWithLengthPrefix(outBin, tr, PrefixStyle.Base128);
            }

            outBin.Close();
            label1.Text = "Done ";
        }

        private void Deserialize_Click(object sender, EventArgs e)
        {
            Stopwatch sw = new Stopwatch();
            sw.Start();

            FileStream fs;
            string binFileName = @"C:\binfile.dft";

            fs = new FileStream(binFileName, FileMode.Open, FileAccess.Read, FileShare.Read, 4 * 4096);
            long skipRate =10;
            int count = 0;
            TickRecord tr;

            long skip = (38*skipRate);
            try
            {
                while ((tr = Serializer.DeserializeWithLengthPrefix<TickRecord>(fs, PrefixStyle.Base128)) != null) //fs.Length > fs.Position)
                {
                    count++;

                    fs.Position += skip;  //Comment out this line to see raw speed

                }
            }
            catch (Exception)
            {

            }

            fs.Close();

            sw.Stop();
            label1.Text = "Time taken: " + sw.Elapsed + " Count: " + count.ToString("n0");

        }
    }


    [ProtoContract]
    public class TickRecord
    {

        [ProtoMember(1, DataFormat = DataFormat.FixedSize)]
        public DateTime DT;
        [ProtoMember(2)]
        public double BidPrice;
        [ProtoMember(3)]
        public double AskPrice;
        [ProtoMember(4, DataFormat = DataFormat.FixedSize)]
        public int BidSize;
        [ProtoMember(5, DataFormat = DataFormat.FixedSize)]
        public int AskSize;

        public TickRecord()
        {

        }

        public TickRecord(DateTime DT, double BidPrice, double AskPrice, int BidSize, int AskSize)
        {
            this.DT = DT;
            this.BidPrice = BidPrice;
            this.AskPrice = AskPrice;
            this.BidSize = BidSize;
            this.AskSize = AskSize;

        }



    }
}

Answer 1

磁盘读取单个字节的速度比读取两个字节的速度快。磁盘必须一次读取大块。因此，跳过少数记录实际上不会改变性能。因此，您只需支付固定价格即可读取一些最小数据。该大小因磁盘而异。

更重要的是，调用文件API会产生很大的开销。如果您一次只阅读少量金额，那么您将一次又一次地支付这笔费用。在代码中实现缓冲会更好。将大块数据读入内存，然后解析内存中的实际读取。可能最有效的方法是使用memory mapped file。

有没有比FileStream.Position更快的方式来跳过部分二进制文件

1 个答案: