考虑使用HDF5格式。

Question

数据集如下：

39861    // number of documents
28102    // number of words of the vocabulary (another file)
3710420  // number of nonzero counts in the bag-of-words
1 118 1  // document_id index_in_vocabulary count
1 285 3
...
2 46 1
...
39861 27196 5

我们被建议不将其存储在矩阵中（大小为39861 x 39861），因为它不适合内存^*和{ {3}}我可以假设每个整数都需要存储24个字节，因此27 Gb（= 39861 * 28102 * 24字节）具有密集矩阵。那么，我应该使用哪种数据结构来存储数据集？

列表数组？

如果是这样的话（每个列表都有两个数据成员的节点，那么 index_in_vocubulary和count），只是发表一个肯定的答案。如果我假设每个文档平均有200个单词，那么空格将是：

no_of_documents x words_per_doc * no_of_datamembers * 24 = 39861 * 200 * 2 * 24 = 0.4 Gb

如果没有，你会建议哪一个（这需要更少的空间）？

在存储数据集之后，我们需要找到k-Nearest Neighbors（k个类似的文档），有暴力和LSH。

* _{我的个人笔记本电脑有3.8 GiB，但我可以使用~8Gb RAM的桌面。}

Answer 1

考虑使用HDF5格式。

它会显着减小文件的大小。

请参阅my answer to similar question

如何存储这个文件集？

1 个答案:

考虑使用HDF5格式。