Question

我正在开发一个linux程序，它应该解析从另一台计算机或互联网下载的文件，并从该文件中收集信息。该程序还必须按例程重新下载该文件，每n天/小时/分钟/不等，并再次解析以保持更新，以防文件发生变化。

但是，解析文件的过程可能需要大量资源。因此，我想要一个函数来检查文件自上次下载以来是否已更改。我想象这样的例子：

int get_checksum(char *filename) {
    // New prototype, if no such function already exists in standard C-libraries
    int result;           // Or char/float/whatever


    // ...


    return result;
}
int main(void) {

    char filename[] = { "foo.dat" };
    char file_url[] = { "http://example.com/foo.dat" }
    int old_checksum;     // Or char/float/whatever
    int new_checksum;     // Or char/float/whatever


    // ...


    // Now assume that old_checksum has a value from before:

    dl_file(filename, file_url);    // Some prototype for downloading the file
    if ((new_checksum = get_checksum(filename)) == -1) {
        // Badness
    }
    else {
        if (new_checksum != old_checksum) {
            old_checksum = new_checksum;
            // Parse the file
        }
        else {
            // Do nothing
        }
    }


    // ...


}

Q1：标准C / C ++库中是否有 get_checksum （来自上例）？

Q2：如果没有：达到此目的的最佳方法是什么？

没有必要：
- 非常先进的功能
- 加密或安全的校验和
- 能够将新文件与早于最后一个文件的文件进行比较，因为新下载的文件将始终覆盖旧文件

Answer 1

您可以使用stat()功能。它可以让您访问文件参数，如上次访问时间，上次修改时间，文件大小等：

struct stat {
    dev_t     st_dev;     /* ID of device containing file */
    ino_t     st_ino;     /* inode number */
    mode_t    st_mode;    /* protection */
    nlink_t   st_nlink;   /* number of hard links */
    uid_t     st_uid;     /* user ID of owner */
    gid_t     st_gid;     /* group ID of owner */
    dev_t     st_rdev;    /* device ID (if special file) */
    off_t     st_size;    /* total size, in bytes */
    blksize_t st_blksize; /* blocksize for file system I/O */
    blkcnt_t  st_blocks;  /* number of 512B blocks allocated */
    time_t    st_atime;   /* time of last access */
    time_t    st_mtime;   /* time of last modification */
    time_t    st_ctime;   /* time of last status change */
};

但是你需要对你将使用它的文件拥有执行权限。

man page

Answer 2

你可以做一个XOR哈希，你只需要连续的无符号整数/长整数块，但这有碰撞问题。例如，如果文件主要是字符，那么大多数字节将在普通的ASCII / Unicode字符范围内，因此会有很多未使用的密钥空间。

对于标准实现，您可以将文件读入字符串并使用C ++ 11中的std :: hash。 http://en.cppreference.com/w/cpp/utility/hash

以下是第一种方法的示例：

unsigned int hash(vector<char> file){
    unsigned int result;
    int *arr = (int*)file.data();

    for(int i = 0;i < file.size() / sizeof(unsigned int);i++)
        result ^= arr[i];

    return result;
}

您只需将文件读入矢量。

Answer 3

在std :: hash＆lt;＆gt;之前，C ++语言中没有内置任何东西。在C ++ 11中，它非常简单，但可能适合您的需要。

最后我检查了Boost中没有任何东西（最常见的C ++库扩展）。这里讨论的是推理，但可能会过时：

http://www.gamedev.net/topic/528553-why-doesnt-boost-have-a-cryptographic-hash-library/

所以，你最好的选择是：

std::hash包含文件内容。

或类似以下内容可以保存到一个简单的标题并链接：

http://www.zedwood.com/article/cpp-md5-function

或者您可以获得OpenSSL或Crypto++等库。

用于确定下载的文件是否与现有文件相同的功能

3 个答案: