Question

我在csv文件中有原始和未过滤的记录（超过1000000条记录），我想从文件列表中过滤掉这些记录（每个文件重量超过282MB;大约超过2000000条记录）。我尝试在C中使用strstr。这是我的代码：

while (!feof(rawfh)) //loop to read records from raw file
{   
    j=0; //counter


    while( (c = fgetc(rawfh))!='\n' && !feof(rawfh)) //read a line from raw file
     {
        line[j] = c; line[j+1] = '\0'; j++;
     }
     //function to extract the element in the specified column, in the CSV
     extractcol(line, relcolraw, entry);

     printf("\nWorking on : %s", entry);


     found=0;
     //read a set of 4000 bytes; this is the target file
     while( fgets(buffer, 4000, dncfh)!=NULL && !found )
     {
        if( strstr(buffer, entry) !=NULL) //compare it
          found++;
     }
     rewind(dncfh); //put the file pointer back to the start

   // if the record was not found in the target list, write it into another file
     if(!found)
      { 
         fprintf(out, "%s,\n", entry); printf(" *** written to filtered ***"); 
      }
      else 
      {
        found=0; printf(" *** Found ***");
      }
      //I hope this is the right way to null out a string
      entry[0] = '\0'; line[0] ='\0'; 

      //just to display a # on the screen, to let the user know that the program
      //is still alive and running.
      rawreccntr++;
      if(rawreccntr>=10) 
      {
        printf("#"); rawreccntr=0;
      } 
}

此程序平均需要大约7到10秒来搜索目标文件中的一个条目（282 MB）。所以，10 * 1000000 = 10000000秒:(如果我决定搜索25个文件，天知道要花多少钱。

我正在考虑编写一个程序，而不是用勺子喂食解决方案（grep，sed等）。哦，抱歉，但我使用的是Windows 8（64位，4 GB RAM，AMD处理器Radeon 2核心 - 1000Mhz）。我使用DevC ++（gcc）来编译它。

请用你的想法启发我。

提前致谢，对不起，如果我听起来很愚蠢。

按Ali 更新，这是从评论中提取的关键信息：

我有一个原始的CSV文件，其中包含客户电话号码和地址的详细信息。我有CSV格式的目标文件;请勿呼叫列表。我想编写一个程序来过滤掉Do No Call List中不存在的电话号码。电话号码（两个文件）都在第二列。但是，我不知道任何其他方法。我搜索了Boyer-Moore算法，然而，无法在C中实现。任何有关如何搜索记录的建议？

Answer 1

<强> EDITED

我建议您在任何Unix / Linux系统中尝试使用现成的工具， grep 和 awk 。您可能会发现它们同样快速且易于维护。我没有看到你的数据格式，但你说电话号码在第二栏，所以你可以自己得到这样的电话号码：

awk '{print $2}' DontCallFile.csv

如果您的电话号码是双引号，则可以删除以下内容：

awk '{print $2}' DontCallFile.csv | tr -d '"'

然后你可以使用{em> fgrep 和-f选项来搜索一个文件中列出的字符串是否存在于第二个文件中，如下所示：

fgrep -f file1.csv file2.csv

或者您可以通过将-v开关添加到 fgrep 来反转搜索并搜索其他文件中不存在的字符串。

所以，你的最终命令可能会像这样结束：

fgrep -v -f <(awk '{print $2}' DontCallFile.csv | tr -d '"') file2.csv

这就是说...在file2.csv中搜索文件“DontCallFile.csv”第2列中不存在的所有字符串（-v选项）。如果你想理解<()中的位，它被称为进程替换，它基本上是在括号内运行命令的结果中产生一个伪文件。我们需要一个伪文件，因为fgrep -f需要一个文件。

原始回答

为什么你还在使用fgetc（）。当然你会像这样使用getline（）：

 while(getline(myfile,line ))
 { 
 ...
 }

您是否真的从一开始就为主文件中的每一行读取整个“目标”文件？那会杀了你！你为什么要用4,000个字节的块来做呢？如果你的一个字符串跨越你所比较的4,000个字节 - 即前8个字节在一个4k块中，最后一个字节在4k块中是多少呢？

如果你花时间正确地解释你想要做什么，我认为你会得到更好的帮助 - 也许用awk或grep（至少比喻）来做，所以我们可以看到你实际上想要做什么实现。例如，您的解密没有提及您在代码中使用的“目标”文件。

Answer 2

您可以使用 awk 执行此操作，如下所示：

awk -F, '
     FNR==NR {gsub(/"/,"",$2);dcn[$2]++;next}
     {gsub(/ /,"",$2);if(!dcn[$2])print}
' DontCallFile.csv x.csv

那说......字段分隔符是逗号（-F,）。现在读取第一个文件（DontCallFile.csv）并根据FNR==NR之后的花括号中的部分进行处理。使用gsub（全局替换）删除字段2中电话号码周围的双引号。然后将关联数组中的元素（即哈希）递增，由不带引号的字段2索引，然后移动到下一个记录。所以基本上，在处理文件“DontCallFile.csv”之后，数组dcn []将保存所有不调用的数字的散列（dcn = dontcallnumbers）。然后，对第二个文件的每一行（“x.csv”）执行第二组花括号中的代码。这就是说......删除字段2中电话号码周围的所有空格。然后，如果我们之前构建的阵列dcn []中没有该电话号码，则打印该行。

Answer 3

这是一个改进的想法...

在下面的代码中，在每次迭代时设置line[j+1] = '\0'有什么意义？

while( (c = fgetc(rawfh))!='\n' && !feof(rawfh))
{
    line[j] = c; line[j+1] = '\0'; j++;
}

你可以在循环之外做到这一点：

while( (c = fgetc(rawfh))!='\n' && !feof(rawfh))
    line[j++] = c;
line[j] = '\0';

Answer 4

我的建议如下。

将 all 放入一个阵列中不要拨打电话号码。
对此数组进行排序。
使用二进制搜索来检查给定的电话号码是否在已排序的电话号码中不要拨电话。

在下面的代码中，我只是对数字进行了硬编码。在您的应用程序中，您必须将其替换为相应的代码。

#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int compare(const void* a, const void* b) {
    return (strcmp(*(char **)a, *(char **)b));
}

int binary_search(const char** first, const char** last, const char* val) {
  ptrdiff_t len = last - first;
  while (len > 0) {
    ptrdiff_t half = len >> 1;
    const char** middle = first;
    middle += half;
    if (compare(&*middle, &val)) {
      first = middle;
      ++first;
      len = len - half - 1;
    }
    else
      len = half;
  }
  return first != last && !compare(&val,&*first);
}

int main(int argc, char** argv) {

  size_t i;

  /* Read _all_ of your don't call phone numbers into an array. */
  /* For the sake of the example, I just hard-coded it. */
  char* dont_call[] = { "908-444-555", "800-200-400", "987-654-321" };

  /* in your program, change length to the number of dont_call numbers actually read. */
  size_t length = sizeof dont_call / sizeof dont_call[0];

  qsort(dont_call, length, sizeof(char *), compare);

  printf("The don\'t call numbers sorted\n");

  for (i=0; i<length; ++i)
    printf("%lu  %s\n", i, dont_call[i]);

  /* For each phone number, check if it is in the sorted dont_call list. */
  /* Use binary search to check it. */
  char* numbers[] = { "999-000-111", "333-444-555", "987-654-321" };

  size_t n = sizeof numbers / sizeof numbers[0];

  printf("Now checking if we should call a given number\n");

  for (i=0; i<n; ++i) {

    int should_call =  binary_search((const char **)dont_call, (const char **)dont_call+length, numbers[i]);

    char* as_text = should_call ? "no" : "yes";

    printf("Should we call %s? %s\n",numbers[i], as_text);
  }

  return 0;
}

打印：

    The don't call numbers sorted  
    0  800-200-400  
    1  908-444-555  
    2  987-654-321  
    Now checking if we should call a given number  
    Should we call 999-000-111? yes  
    Should we call 333-444-555? yes  
    Should we call 987-654-321? no

代码绝对不是完美的，但它足以让你开始。

Answer 5

算法的问题是复杂性。您的方法是O(n*m)，其中n是客户数量，m是do_not_call记录的数量（或您的案例中的文件大小）。您需要降低这种复杂性。（并且Boyer-Moore算法对Ali建议的那些没有帮助。它不会改善渐近复杂度而只是常数。）即使在他的Ali中建议的answer二分搜索也不是最好的。它将是O((n+m)*log m)。我们可以做得更好。很好的解决方案是在Mark Setchell的答案中使用fgrep和awk。（我会选择一个使用fgrep，它应该表现更好我猜，但它只是猜测。）我可以在Perl中提供一个类似的解决方案，它将提供更强大的CSV解析，并且应该在容易的硬件上处理您的数据大小。此类解决方案的复杂性为O(n+m)。

#!/usr/bin/env perl

use strict;
use warnings;
use autodie;
use Text::CSV_XS;

use constant PHN_COL_DNC => 1;
use constant PHN_COL_CUSTOMERS => 1;

die "Usage: $0 dnc_file [customers]" unless @ARGV>0;
my $dncfile = shift @ARGV;

my $csv = Text::CSV_XS->new({eol=>"\n", allow_whitespace=>1, binary=>1});
my %dnc;

open my $dnc, '<', $dncfile;
while(my $row = $csv->getline($dnc)){
    $dnc{$row->[PHN_COL_DNC]} = undef;
}
close $dnc;

while(my $row = $csv->getline(*ARGV)){
    $csv->print(*STDOUT, $row) unless exists $dnc{$row->[PHN_COL_CUSTOMERS]};
}

如果它不符合我们的性能预期，你可以去C路，但我肯定会建议使用一些好的csv解析和hashmap库。我会尝试libcsv和khash.h

从C中的巨大.csv文件中过滤文本

5 个答案: