Question

如何在不使用SQL的情况下以动态方式存储和检索3,000,000多个单词。

从文档中获取一个单词，然后检查该单词是否可用。

如果有，然后在相应的文件计数中增加它......

如果没有即，新单词然后创建一个新列然后增加文档计数并放入零到所有其他文件。

例如..

我有93,000个文件，每个文件包含或多或少5000个单词。如果有新单词，则添加新列。同样地来了960000个单词。

---------------- Word1 word2 word3 word4 word5 ... .----新词...... word96000

Document1 ---- 2 ---- 19 ---- 45 ---- 16 ---- 7 ---- ------ ... .0 - --- .. ---- ..

Document2 ---- 4 ---- 6 ---- 3 ---- 56 ---- 3 ---- .... -------- 0 ---- .. ---- ..

Document3 ---- 56 ---- 34 ---- 1 ---- 67 ---- 4 ---- .... -------- 0 ---- .. ---- ..

Document4 ---- 7 ---- 45 ---- 9 ---- 45 ---- 6 ---- .... -------- 0 ---- .. ---- ..

Document5 ---- 56 ---- 43 ---- 234 ---- 87 ---- 46 ---- .... -------- 0 ---- ..

Document6 ---- 56 ---- 6 ---- 2 ---- 5 ---- 23 ---- .... -------- 0 ---- .. ---- ..

。 ...。。 .. ..

Document1000 ---- 5 ---- 9 ---- 9 ---- 89 ---- 34 ---- .... -------- 1 .. ..

添加的单词的计数在相应文档的条目中动态更新。

Answer 1

这种稀疏矩阵通常最好用作词典词典。

Dictionary<string, Dictionary<string, int> index;

但问题缺乏太多细节以提供更多建议。

Answer 2

为避免浪费记忆，我建议如下：

class Document {
   List<int> words;
}
List<Document> documents;

如果您有1000个文件，则创建List<Document> documents = new List<Document>(1000);
现在，如果document1包含单词：word2，word19和word45，请将这些单词的索引添加到文档中

documents[0].words.add(2);
documents[0].words.add(19);
documents[0].words.add(45);

您可以修改代码以存储单词本身要查看word2重复的次数，您可以抛出整个文档列表，看看文档是否包含单词index。

foreach (Document d in documents) {
   if (d.words.Contain(2)) {
      count++;
   }
}

Answer 3

var nWords = (from Match m in Regex.Matches(File.ReadAllText("big.txt").ToLower(), "[a-z]+")
              group m.Value by m.Value)
             .ToDictionary(gr => gr.Key, gr => gr.Count());

为您提供按字和计数索引的字典列表。我确信您可以在读取每个文件时保存信息，然后构建最终列表。可能？

使用集合在c＃.NET中动态存储和检索3,000,000多个单词

3 个答案: