Question

我正在寻找一种方法来计算非常大的文件的SHA-1校验和，而不必立即将它们完全加载到内存中。

我不知道SHA-1实现的细节，因此想知道是否有可能这样做。

如果你知道SAX XML解析器，那么我所寻找的东西就是类似的东西：通过一次只将一小部分加载到内存来计算SHA-1校验和。

我发现的所有示例（至少在Java中）总是依赖于将文件/字节数组/字符串完全加载到内存中。

如果你甚至知道实施（任何语言），请告诉我！

Answer 1

Java文档说要使用MessageDigest类来计算任意大小数据的SHA-1。

Answer 2

是的，这完全有可能。 SHA-1是块算法，因此它一次在一个块上运行。确切的块大小会随着您生成的哈希值的大小而变化，但它总是很小（例如，20 - 50字节左右）。当然，这是假设您要包括SHA-1 256,384和/或512（也称为SHA-256，SHA-384和SHA-512）。如果您只包含原始的160位版本，则块大小始终为20个字节（160位）。

编辑：oops - 重新阅读规范，SHA-1，SHA-224，SHA-256的块大小为512位，SHA-384和SHA-512的块大小为1024位。谢谢查尔斯！

Edit2：我几乎忘记了他正在寻找源代码的部分，而不仅仅是建议。这是一些代码。首先是标题：

// Sha.h:
#ifndef SHA_1_H_INCLUDED
#define SHA_1_H_INCLUDED

// This is a relatively straightforward implementation of SHA-1. It makes no particular
// attempt at optimization, instead aiming toward easy verification against the standard.
// To that end, many of the variable names are identical to those used in FIPS 180-2 and
// FIPS 180-3. 
//
// The code should be fairly portable, within a few limitations:
// 1. It requires that 'char' have 8 bits. In theory this is avoidable, but I don't think
// it's worth the bother.
// 2. It only deals with inputs in (8-bit) bytes. In theory, SHA-1 can deal with a number of 
// bits that's not a multiple of 8, but I've never needed it. Since the padding always results
// in a byte-sized stream, the only parts that would need changing would be reading and padding
// the input. The main hashing portion would be unaffected.
//
// Compiles cleanly with:
//    MS VC++ 9.0SP1 (x86 or x64): -W4 -Za
//    gc++ 3.4: -ansi -pedantic -Wall
//    comeau 4.3.3: --vc71 
// Appears to work corectly in all cases. 
// You can't use maximum warnings with Comeau though -- this code itself doesn't give problems
// (that I know of) but Microsoft's headers give it *major* heartburn.
// 
//
// Written by Jerry Coffin, February 2008
// 
// You can use this software any way you want to, with following limitations
// (shamelessly stolen from the Boost software license):
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT
// SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE
// FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE,
// ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
// DEALINGS IN THE SOFTWARE.
// 
// If you put this to real use, I'd be happy to hear about it. If you find a bug, 
// I'd be interested in hearing about that too. There's even a pretty good chance 
// that I'll try to fix it, though I certainly can't guarantee that.
// 
#include <algorithm>
#include <vector>
#include <string>
#include <assert.h>
#include <iostream>
#include <sstream>
#include <iomanip>

#if defined(_MSC_VER) && _MSC_VER < 1600
typedef unsigned int uint32_t;
typedef unsigned __int64 uint64_t;
#else
#include <stdint.h>
#endif

namespace crypto { 
namespace {
    struct ternary_operator { 
        virtual uint32_t operator()(uint32_t x, uint32_t y, uint32_t z) = 0;
    };
}

class sha1 { 
    static const size_t hash_size = 5;
    static const size_t min_pad = 64;
    static const size_t block_bits = 512;
    static const size_t block_bytes = block_bits / 8;
    static const size_t block_words = block_bytes / 4;

    std::vector<uint32_t> K;
    std::vector<uint32_t> H;
    std::vector<uint32_t> W;
    std::vector<ternary_operator *> fs;
    uint32_t a, b, c, d, e, T;
    static const size_t block_size = 16;
    static const size_t bytes_per_word = 4;
    size_t total_size;

    // hash a 512-bit block of input.
    //
    void hash_block(std::vector<uint32_t> const &block);

    // Pad the input to a multiple of 512 bits, and add the length
    // in binary to the end.
    static std::string pad(std::string const &input, size_t size);

    // Turn 64 bytes into a block of 16 uint32_t's.
    std::vector<uint32_t> make_block(std::string const &in);

public:

    // Construct a SHA-1 object. More expensive that typical 
    // ctor, but not expected to be copied a lot or anything
    // like that, so it should be fairly harmless.
    sha1();

    // The two ways to provide input for hashing: as a stream or a string.
    // Either way, you get the result as a vector<uint32_t>. It's a fairly
    // small vector, so even if your compiler doesn't do return-value 
    // optimization, the time taken for copying it isn't like to be 
    // significant.
    //
    std::vector<uint32_t> operator()(std::istream &in);
    std::vector<uint32_t> operator()(std::string const &input);

    friend std::ostream &operator<<(std::ostream &os, sha1 const &s);
};
}

#endif

实施：

// Sha1.cpp:
#include "sha.h"
// Please see comments in sha.h for licensing information, etc.
//

// Many people don't like the names I usually use for namespaces, so I've kept this one
// short and simple.
//
namespace crypto {
namespace {  
uint32_t ROTL(uint32_t const &value, unsigned bits) { 
    uint32_t mask = (1 << bits) - 1;
    return value << bits | (value >> (32-bits))&mask;
}

struct f1 : ternary_operator {
    uint32_t operator()(uint32_t x, uint32_t y, uint32_t z) { 
        return (x & y) ^ (~x&z);
    }
};

struct f2 : ternary_operator {
    uint32_t operator()(uint32_t x, uint32_t y, uint32_t z) {
        return x ^ y ^ z;
    }
};

struct f3 : ternary_operator {
    uint32_t operator()(uint32_t x, uint32_t y, uint32_t z) {
        return (x&y) ^ (x&z) ^ (y&z);
    }
};

uint32_t word(int a, int b, int c, int d) {
    a &= 0xff;
    b &= 0xff;
    c &= 0xff;
    d &= 0xff;
    int val =  a << 24 | b << 16 | c << 8 | d;
    return val;
}
}

// hash a 512-bit block of input.
//
void sha1::hash_block(std::vector<uint32_t> const &block) {
    assert(block.size() == block_words);

    int t;
    std::copy(block.begin(), block.end(), W.begin());
    for (t=16; t<80; t++) {
        W[t] = ROTL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
    }

    a = H[0]; b = H[1]; c = H[2]; d = H[3]; e = H[4];

    for (t=0; t<80; t++) {
        T = ROTL(a, 5) + (*fs[t])(b, c, d) + e + K[t] + W[t];
        e = d;
        d = c;
        c = ROTL(b, 30);
        b = a;
        a = T;
    }
    H[0] += a; H[1] += b; H[2] += c; H[3] += d; H[4] += e;
}

// Pad the input to a multiple of 512 bits, and put the length
// in binary at the end. 
std::string sha1::pad(std::string const &input, size_t size) {
    size_t length = size * 8 + 1;
    size_t remainder = length % block_bits;
    size_t pad_len = block_bits-remainder;

    if (pad_len < min_pad)
        pad_len += block_bits;
    ++pad_len;

    pad_len &= ~7;
    std::string padding(pad_len/8, '\0');

    for (size_t i=0; i<sizeof(padding.size()); i++)
        padding[padding.size()-i-1] = (length-1) >> (i*8) & 0xff;
    padding[0] |= (unsigned char)0x80;

    std::string ret(input+padding);
    return ret;
}

// Turn 64 bytes into a block of 16 uint32_t's.
std::vector<uint32_t> sha1::make_block(std::string const &in) { 
    assert(in.size() >= block_bytes);

    std::vector<uint32_t> ret(block_words);

    for (size_t i=0; i<block_words; i++) {
        size_t s = i*4;
        ret[i] = word(in[s], in[s+1], in[s+2], in[s+3]);
    }
    return ret;
}


// Construct a SHA-1 object. More expensive that typical 
// ctor, but not expected to be copied a lot or anything
// like that, so it should be fairly harmless.
sha1::sha1() : K(80), H(5), W(80), fs(80), total_size(0) {
    static const uint32_t H0[] = { 
        0x67452301, 0xefcdab89, 0x98badcfe, 0x10325476, 0xc3d2e1f0
    };
    static const uint32_t Ks[] = {
        0x5a827999, 0x6ed9eba1, 0x8f1bbcdc, 0xca62c1d6
    };

    std::copy(H0, H0+hash_size, H.begin());

    std::fill_n(K.begin()+00, 20, Ks[0]);
    std::fill_n(K.begin()+20, 20, Ks[1]);
    std::fill_n(K.begin()+40, 20, Ks[2]);
    std::fill_n(K.begin()+60, 20, Ks[3]);

    static f1 sf1;
    static f2 sf2;
    static f3 sf3;

    std::fill_n(fs.begin()+00, 20, &sf1);
    std::fill_n(fs.begin()+20, 20, &sf2);
    std::fill_n(fs.begin()+40, 20, &sf3);
    std::fill_n(fs.begin()+60, 20, &sf2);
}

// The two ways to provide input for hashing: as a stream or a string.
// Either way, you get the result as a vector<uint32_t>. It's a fairly
// small vector, so even if your compiler doesn't do return-value 
// optimization, the time taken for copying it isn't like to be 
// significant.
// 
std::vector<uint32_t> sha1::operator()(std::string const &input) { 
    std::string temp(pad(input, total_size + input.size()));
    std::vector<uint32_t> block(block_size);

    size_t num = temp.size()/block_bytes;

    for (unsigned block_num=0; block_num<num; block_num++) {
        size_t s;
        for (size_t i=0; i<block_size; i++) {
            s = block_num*block_bytes+i*4;
            block[i] = word(temp[s], temp[s+1], temp[s+2], temp[s+3]);
        }
        hash_block(block);  
    }
    return H;
}

std::vector<uint32_t> sha1::operator()(std::istream &in) { 
    char raw_block[65];

    while (in.read(raw_block, block_bytes)) {
        total_size += block_bytes;
        std::string b(raw_block, in.gcount());
        hash_block(make_block(b));
    }
    std::string x(raw_block, in.gcount());
    return operator()(x);
}

std::ostream &operator<<(std::ostream &os, sha1 const &s) { 
    // Display a SHA-1 result in hex.
    for (size_t i=0; i<(s.H).size(); i++)
        os << std::fixed << std::setprecision(8) << std::hex << std::setfill('0') << (s.H)[i] << " ";
    return os << std::dec << std::setfill(' ') << "\n";
}

}

#ifdef TEST
#include <iostream>
#include <iomanip>
#include <string>
#include <sstream>

// A minimal test harness to check that it's working correctly. Strictly black-box
// testing, with no attempt at things like coverage analysis. Nonetheless, I believe
// it should cover most of the code -- the core hashing code all gets used for every
// possible value. The padding code should be tested fairly thoroughly as well -- the
// first test is a fairly simple case, and the second the more complex one (where the 
// padding requires adding another block).
class tester {
    bool verify(uint32_t *test_val, std::vector<uint32_t> const &hash, std::ostream &os) {
        // Verify that a result matches a test value and report result.
        for (size_t i=0; i<hash.size(); i++)
            if (hash[i] != test_val[i]) {
                os << "Mismatch. Expected: " << test_val[i] << ", but found: " << hash[i] << "\n";
                return false;
            }
            os << "Message digest Verified.\n\n";
            return true;
    }

public:

    bool operator()(uint32_t *test_val, std::string const &input) {
        std::cout << "Testing hashing from string:\n\"" << input << "\"\n";
        crypto::sha1 hasher1;
        std::vector<uint32_t> hash = hasher1(input);
        std::cout << "Message digest is:\n\t" << hasher1;
        bool verified = verify(test_val, hash, std::cerr);

        crypto::sha1 hasher2;
        std::cout << "Testing hashing from Stream:\n";
        std::istringstream buf(input);
        hash = hasher2(buf);
        std::cout << "Message digest is:\n\t" << hasher2;

        return verified & verify(test_val, hash, std::cerr);
    }
};

int main() {
    // These test values and results come directly from the FIPS pub.
    //
    char const *input1 = "abc";
    char const *input2 = "abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq";
    uint32_t result1[] = {0xA9993E36, 0x4706816A, 0xBA3E2571, 0x7850C26C, 0x9CD0D89D};
    uint32_t result2[] = {0x84983E44, 0x1C3BD26E, 0xBAAE4AA1, 0xF95129E5, 0xE54670F1 };
    bool correct = tester()(result1, input1);
    correct &= tester()(result2, input2);
    if (correct)
        std::cerr << "All Tests passed!\n";
    else
        std::cerr << "Test Failed!\n";
}
#elif defined(MAIN) 

#include <sstream>
#include <fstream>
#include <iostream>

int main(int argc, char **argv) { 
    if (argc < 2) {
        std::cerr << "Usage: sha1 [filename]\n";
        return EXIT_FAILURE;
    }
    for (int i=1; i<argc; i++) {
        crypto::sha1 hash;
        std::ifstream in(argv[i], std::ios_base::binary);
        if (in.good()) {
            hash(in);
            std::cout << "SHA-1(" << argv[i] << ") = " << hash << "\n";
        }
    }
    return 0;
}
#endif

Answer 3

使用DigestInputStream或DigestOutputStream类，您可以透明且几乎毫不费力地完成此操作。或者您可以手动使用MessageDigest，这几乎一样容易。

Answer 4

是的，它可以用于哈希流，因为它是迭代的：你每次迭代都会得到512位，你获得了一个新的512位块，可以用于下一个。

在这里你可以找到伪代码：link。在Java中实现它应该很容易。当你遇到最后一个块和一些按位操作时，你只需要做一些填充！

警告：唯一的事情是，通常需要 unsigned ints ，而Java只提供签名，你应该做一些技巧来避免问题..

Answer 5

是的。您只需要一次读取512位（64字节）的块来计算SHA-1哈希值。

您需要跟踪流的长度并在最后一个或两个块中预先形成正确的填充但是，这是完全可行的。

我以前用C ++编写过这样的实现，但我担心我不能自由发布它。

Answer 6

SHA-1以块为单位处理数据，因此您可以在流中处理文件。我非常有信心，充气城堡加密库在足够低的水平上实现SHA-1，您可以流式传输数据。我使用其他块密码与充气城堡加密库非常相似。

http://www.bouncycastle.org/

可以在流上计算SHA-1算法吗？内存占用少？

6 个答案: