Question

所以，我有两个财务数据文件，比如'符号'和'卷'。在符号中我有字符串，如：

FOO
BAR
BAZINGA
...

在卷中，我有整数值，例如：

这个想法是股票代码将在文件中重复，我需要找到每个股票的总交易量。因此，我观察到foo的每一行，我将foo的总体积增加了体积中观察到的值。问题是这些文件可能很大：容易记录5到1亿条记录。典型的一天可能在文件中有~1K个不同的符号。

在符号上使用strcmp每个新行都是非常低效的。我正在考虑使用一个关联数组---哈希表库，它允许使用字符串键 - 例如uthash或Glib的哈希表。

我正在阅读关于Judy arrays的一些非常好的事情？在这种情况下许可是一个问题吗？

关于选择有效的哈希表实现的任何想法？而且，我是否应该完全使用哈希表或完全使用其他东西。

嗯..早些时候遗漏道歉：我需要一个纯粹的C解决方案。

感谢。

Answer 1

绝对哈希表听起来不错。您应该查看libiberty实现。您可以在GCC项目Here上找到它。

Answer 2

我会使用Map的{{1}}。以下是伪代码的样子：

C++ STL

根据您提供的数据量，可能会有点低效，但我建议这样做，因为它更容易实现。

如果要在map< string, long int > Mp; while(eof is not reached) { String stock_name=readline_from_file1(); long int stock_value=readline_from_file2(); Mp[stock_name]+=stock_value; } for(each stock_name in Mp) cout<<stock_name<<" "<<stock_value<<endl;中严格执行解决方案，那么C将是最佳解决方案。但是，如果您认为实现哈希表并编写代码以避免hashing是复杂的，我还有另一个使用collisions的想法。这可能听起来很奇怪，但这也有点帮助。

我建议你阅读this一个。它对trie是什么以及如何构建它有一个很好的解释。在那里也给出了C中的实现。因此，您可能会怀疑每个trie的{{1}}存储位置。该值可以存储在volumes的末尾，并且可以在需要时轻松更新。

但是当你说你是C的新手时，我建议你尝试使用stock实现，然后尝试这个。

Answer 3

思考为什么不坚持你的关联数组的想法。我假设，在执行结束时，您需要有一个具有聚合值的唯一名称列表。只要您拥有所有唯一名称的内存，下面将起作用。当然，这可能效率不高，但是，根据您的数据模式，可以做很少的技巧。

Consolidate_Index =0;

struct sutruct_Customers
{
name[];
value[];
}

sutruct_Customers Customers[This_Could_be_worse_if_all_names_are_unique]

void consolidate_names(char *name , int value)
{
    for(i=0;i<Consolidate_Index;i++){
        if(Customers[i].name & name)
            {
            Customers[i].value+= Values[index];

            }
    else
    {
    Allocate memory for Name Now!
    Customers[Consolidate_Index].name = name;
    Customers[Consolidate_Index].value = Value;
    Consolidate_Index++;
    }
    }
}

main(){

sutruct_Customers buffer[Size_In_Each_Iteration]

while(unless file is done){

file-data-chunk_names to buffer.name
file-data-chunk_values to buffer.Values

for(; i<Size_In_Each_Iteration;i++)
consolidate_names(buffer.Names , buffer.Values);

}

Answer 4

我的解决方案：

我最终使用JudySL数组来解决这个问题。经过一些阅读，使用Judy实现解决方案非常简单。我正在全面复制解决方案，以便对其他人有用。

#include <stdio.h>
#include <Judy.h>

const unsigned int BUFSIZE = 10; /* A symbol is only 8 chars wide. */

int main (int argc, char const **argv) {

  FILE *fsymb = fopen(argv[1], "r");
  if (fsymb == NULL) return 1;

  FILE *fvol = fopen(argv[2], "r");
  if (fvol == NULL) return 1;

  FILE *fout = fopen(argv[3], "w");
  if (fout == NULL) return 1;

  unsigned int lnumber = 0;
  uint8_t symbol[BUFSIZE];
  unsigned long volume;

  /* Initialize the associative map as a JudySL array. */
  Pvoid_t assmap = (Pvoid_t) NULL;
  Word_t *value;

  while (1) {

    fscanf(fsymb, "%s", symbol);
    if (feof(fsymb)) break;

    fscanf(fvol, "%lu", &volume);
    if (feof(fvol)) break;

    ++lnumber;

    /* Insert a new symbol or return value if exists. */
    JSLI(value, assmap, symbol);
    if (value == PJERR) return 2;
    *value += volume;

  }

  symbol[0] = '\0'; /* Start from the empty string. */
  JSLF(value, assmap, symbol); /* Find the next string in the array. */
  while (value != NULL) {
    fprintf(fout, "%s: %lu\n", symbol, *value); /* Print to output file. */
    JSLN(value, assmap, symbol); /* Get next string. */
  }

  Word_t tmp;
  JSLFA(tmp, assmap); /* Free the entire array. */

  fclose(fsymb);
  fclose(fvol);
  fclose(fout);
  return 0;

}

我在含有300K线的'小'样品上测试了该溶液。输出正确，经过的时间为0.074秒。

C：字符串标识符的整数值之和

4 个答案: