从二进制文件读取到数组:在任意数字之前

时间:2017-05-15 02:40:12

标签: c++ arrays ifstream

我试图从二进制文件读取到char数组。打印数组条目时,将打印任意数字(换行符)和所需的数字。我真的无法理解这一点。 该文件的前几个字节是:  00 00 08 03 00 00 EA 60 00 00 00 1C 00 00 00 1C 00 00

我的代码:

  void MNISTreader::loadImagesAndLabelsToMemory(std::string imagesPath,
                          std::string labelsPath) {
  std::ifstream is(imagesPath.c_str());
  char *data = new char[12];

  is.read(data, 12);

  std::cout << std::hex  << (int)data[2] << std::endl;

  delete [] data;
  is.close();
}

打印出来:

ffffff9b
8

8是对的。前面的数字从执行变为执行。这条换行符来自哪里?

1 个答案:

答案 0 :(得分:1)

您询问了如何从二进制文件中读取数据并将其保存到char[],然后您向我们展示了您为问题提交的代码:

  void MNISTreader::loadImagesAndLabelsToMemory(std::string imagesPath,
                          std::string labelsPath) {
  std::ifstream is(imagesPath.c_str());
  char *data = new char[12];

  is.read(data, 12);

  std::cout << std::hex  << (int)data[2] << std::endl;

  delete [] data;
  is.close();
}

你想知道:

  

前面的数字从执行变为执行。这条换行符来自哪里?

在您真正回答该问题之前,您需要知道二进制文件。那就是内部文件的结构。当您从二进制文件中读取数据时,您必须记住某些程序已将数据写入该文件,并且该数据是以结构化格式编写的。正是这种格式对于每个系列或二进制文件类型而言都是独一无二的。大多数二进制文件通常会遵循一个共同的模式,这样它们就可以容纳一个header然后甚至sub headers,然后是集群,数据包或块等,甚至是标题之后的原始数据,而某些二进制文件可能只是纯粹的原始数据。您必须知道文件在内存中的结构。

  • 数据的结构是什么?
    • 第一次进入文件的数据类型是char = 1 byteint = 4 bytes (32bit system) 8 bytes (64bit system)float = 4bytesdouble = 8bytes等。

根据您的代码,您有一个array char,其大小为12并且知道您要求的内存中char1 byte12 bytes。现在问题在于你连续12个连续的单个字节,并且不知道文件结构如何确定第一个字节是实际的char写的还是unsigned char,或者一个int

考虑由C++ structs创建的这两个不同的二进制文件结构,其中包含所有需要的data,并且两者都以二进制格式写入文件。

两个文件结构都将使用的通用标头结构。

struct Header {
    // Size of Header
    std::string filepath;
    std::string filename;

    unsigned int pathSize;
    unsigned int filenameSize;

    unsigned int headerSize;
    unsigned int dataSizeInBytes;
};

FileA 文件A的唯一结构

struct DataA {
    float width;
    float length;
    float height;
    float dummy; 
}

FileB 文件B的唯一结构

struct DataB {
    double length;
    double width;
}

内存中的文件通常是这样的:

  • First Few Bytes是路径和文件名以及存储的大小
    • 根据文件的数量,这可能因文件而异 用于文件路径和文件名。
    • 在字符串之后我们知道接下来的4种数据类型是无符号的 所以我们知道在32位系统上它将是4字节x 4 = 16总字节
    • 对于64位系统,它将是8字节x 4 = 32个总字节。
    • 如果我们了解系统架构,那么我们就可以轻松地解决这个问题。
    • 在这4个unsigned(s)中,前两个是路径和文件名的长度。现在这些可能是从文件读入的前两个而不是实际路径。这些顺序可以颠倒过来。
    • 接下来是2个重要的无符号
    • 下一个是标题的完整大小,可用于读入和跳过标题。
    • 下一个告诉你要拉入的数据的大小,现在这些数据可能是块数,有多少个块,因为它可能是一系列相同的数据结构,但为了简单起见,我省略了块和计数并使用单个实例结构。
    • 在这里,我们可以按字节数提取要提取的字节数。

让我们考虑两个不同的二进制文件,我们已经超过了所有的头信息,我们正在读取要解析的字节。我们得到的数据大小以字节为单位,而FileA我们得到4 floats = 16bytes,而对于FileB我们得到2 doubles = 16bytes。现在,我们知道如何调用该方法来读取x数据类型的y数据量。由于y现在是typex,我们可以说:y(x)好像y是内置类型而x对于此内置类型,默认内置类型的构造函数的数值初始值设定项是intfloatdoublechar等。

现在让我们说我们正在阅读这两个文件中的任何一个,但是我们不知道数据结构以及它的信息先前是如何存储到文件中的,我们通过标题看到数据内存中的大小为16 bytes,但我们并不知道它是以4 floats = 16 bytes还是2 doubles = 16 bytes存储。两种结构都是16个字节,但具有不同数量的不同数据类型。

这样的总和是,在不知道文件的数据结构并且知道如何解析二进制文件的情况下,它变为X/Y Problem

现在让我们假设您确实知道文件结构,尝试从上面回答您的问题,您可以尝试这个小程序并查看一些结果:

#include <string>
#include <iostream>

int main() {

    // Using Two Strings
    std::string imagesPath("ImagesPath\\");
    std::string labelsPath("LabelsPath\\");

    // Concat of Two Strings
    std::string full = imagesPath + labelsPath;

    // Display Of Both
    std::cout << full << std::endl;

    // Data Type Pointers 
    char* cData = nullptr;
    cData = new char[12];

    unsigned char* ucData = nullptr;
    ucData = new unsigned char[12];

    // Loop To Set Both Pointers To The String
    unsigned n = 0;
    for (; n < 12; ++n) {
        cData[n] = full.at(n);
        ucData[n] = full.at(n);
    }

    // Display Of Both Strings By Character and Unsigned Character
    n = 0;
    for (; n < 12; ++n) {
        std::cout << cData[n];
    }
    std::cout << std::endl;

    n = 0;
    for (; n < 12; ++n) {
        std::cout << ucData[n];
    }
    std::cout << std::endl;
    // Both Yeilds Same Result
    // Okay lets clear out the memory of these pointers and then reuse them.

    delete[] cData;
    delete[] ucData;
    cData = nullptr;
    ucData = nullptr;

    // Create Two Data Structurs 1 For Each Different File
    struct A {
        float length;
        float width;
        float height;
        float padding;
    };

    struct B {
        double length;
        double width;
    };

    // Constants For Our Data Structure Sizes
    const unsigned sizeOfA = sizeof(A);
    const unsigned sizeOfB = sizeof(B);

    // Create And Populate An Instance Of Each
    A a;
    a.length = 3.0f;
    a.width = 3.0f;
    a.height = 3.0f;
    a.padding = 0.0f;

    B b;
    b.length = 5.0;
    b.width = 5.0;

    // Lets First Use The `Char[]` Method for each struct and print them
    // but we need 16 bytes instead of `12` from your problem
    char *aData = nullptr;  // FileA
    char *bData = nullptr;  // FileB

    aData = new char[16];
    bData = new char[16];

    // Since A has 4 floats we know that each float is 4 and 16 / 4 = 4
    aData[0] = a.length;
    aData[4] = a.width;
    aData[8] = a.height;
    aData[12] = a.padding;

    // Print Out Result but by individual bytes without casting for A
    // Don't worry about the compiler warnings and build and run with the
    // warning and compare the differences in what is shown on the screen 
    // between A & B.

    n = 0;
    for (; n < 16; ++n) {
        std::cout << aData[n] << " ";
    }
    std::cout << std::endl;

    // Since B has 2 doubles weknow that each double is 8 and 16 / 8 = 2
    bData[0] = b.length;
    bData[8] = b.width;

    // Print out Result but by individual bytes without casting for B
    n = 0;
    for (; n < 16; ++n) {
        std::cout << bData[n] << " ";
    }
    std::cout << std::endl;

    // Let's Print Out Both Again But By Casting To Their Approriate Types
    n = 0;
    for (; n < 4; ++n) {
        std::cout << reinterpret_cast<float*>(aData[n]) << " ";
    }
    std::cout << std::endl;

    n = 0;
    for (; n < 2; ++n) {
        std::cout << reinterpret_cast<double*>(bData[n]) << " ";
    }
    std::cout << std::endl;

    // Clean Up Memory
    delete[] aData;
    delete[] bData;
    aData = nullptr;
    bData = nullptr;

    // Even By Knowing The Appropriate Sizes We Can See A Difference
    // In The Stored Data Types. We Can Now Do The Same As Above
    // But With Unsigned Char & See If It Makes A Difference.

    unsigned char *ucAData = nullptr;
    unsigned char *ucBData = nullptr;

    ucAData = new unsigned char[16];
    ucBData = new unsigned char[16];

    // Since A has 4 floats we know that each float is 4 and 16 / 4 = 4
    ucAData[0] = a.length;
    ucAData[4] = a.width;
    ucAData[8] = a.height;
    ucAData[12] = a.padding;

    // Print Out Result but by individual bytes without casting for A
    // Don't worry about the compiler warnings and build and run with the
    // warning and compare the differences in what is shown on the screen 
    // between A & B.

    n = 0;
    for (; n < 16; ++n) {
        std::cout << ucAData[n] << " ";
    }
    std::cout << std::endl;

    // Since B has 2 doubles weknow that each double is 8 and 16 / 8 = 2
    ucBData[0] = b.length;
    ucBData[8] = b.width;

    // Print out Result but by individual bytes without casting for B
    n = 0;
    for (; n < 16; ++n) {
        std::cout << ucBData[n] << " ";
    }
    std::cout << std::endl;

    // Let's Print Out Both Again But By Casting To Their Approriate Types
    n = 0;
    for (; n < 4; ++n) {
        std::cout << reinterpret_cast<float*>(ucAData[n]) << " ";
    }
    std::cout << std::endl;

    n = 0;
    for (; n < 2; ++n) {
        std::cout << reinterpret_cast<double*>(ucBData[n]) << " ";
    }
    std::cout << std::endl;

    // Clean Up Memory
    delete[] ucAData;
    delete[] ucBData;
    ucAData = nullptr;
    ucBData = nullptr;

    // So Even Changing From `char` to an `unsigned char` doesn't help here even
    // with reinterpret casting. Because These 2 Files Are Different From One Another.
    // They have a unique signature. Now a family of files where a specific application
    // saves its data to a binary will all follow the same structure. Without knowing
    // the structure of the binary file and knowing how much data to pull in and the big key
    // word here is `what type` of data you are reading in and by how much. This becomes an (X/Y) Problem.
    // This is the hard part about parsing binaries, you need to know the file structure. 

    char c = ' ';
    std::cin.get(c);

    return 0;
}

运行上面的短程序后,不要担心屏幕上显示的每个值是什么;只需看看那些用于比较两种不同文件结构的模式。这只是为了表明struct of floats宽的16 bytesstruct of doubles16 bytes的{​​{1}}不同。因此,当我们回到您的问题并且您正在12 individual consecutive bytes中阅读问题时,这些问题将成为这些12 bytes代表的问题?在32位计算机上是3 ints还是3 unsigned ints,在64位计算机上是2 ints还是2 unsigned ints,还是3 floats,还是{{1} }和2 doubles?您正在阅读的二进制文件的当前数据结构是什么?

编辑在我写的小程序中;我确实忘记尝试或添加打印输出语句中的1 float,他们也可以添加,因为每次打印索引指针都被使用但是没有必要这样做,因为输出到显示器是同样的事情,因为它只能在视觉上显示或表达内存中两个数据结构的差异以及它们的模式。