选择合适的工具

Question

我已阅读有关从文件复制数据的其他帖子。让我告诉你为什么我的情况不同。使用C，我必须从csv文件中读取4300万个输入行。条目没有错误，并采用以下形式：

int, int , int , variable length string with only alphanumerics and spaces , int \n

我正在将所有数据复制到数组和内存中的内存，以便在其上执行一些非常非常简单的平均值，然后将所有已处理的数据输出到文件，没什么特别的。我需要帮助的主要有三个方面：

关于字符串，（我的BIGGEST问题在这里），它首先从文件中读取，然后被复制到一个数组然后传递给另一个函数，只有满足条件才能将它复制到动态内存中。例如：
```
fileAnalizer(){
  while ( ! EOF ){
    char * s = function_to_read_string_from_file();
    data_to_array(s);
  }
  ....
  ....
  processData(arrays);
  dataToFiles(arrays);

}

void data_to_structures(char * s){
  if ( condition is met)
    char * aux = malloc((strlen(s)+1 )* sizeof(char));
    strcpy(aux,s);
  ....
  ...
}
```
正如您所看到的，该字符串已经过了3次。我需要一种方法来更有效地完成这个过程，更少次地使用字符串。我已经尝试通过char读取char并计算字符串长度，但整个过程变慢。

高效读取输入：您是否建议先将所有数据复制到缓冲区？如果是这样：什么类型的缓冲区，在许多块或只有一个？这是我目前的阅读计划：

void
csvReader(FILE* f){
    T_structCDT c;
    c.string = malloc(MAX_STRING_LENGHT*sizeof(char));
    while (fscanf(f,"%d,%d,%d,%[^,],%d\n",&c.a, &c.b, &c.vivienda, c.c, &c.d)==5 ){
        data_to_structures(c);
    }
}

然后我将近500条csv行处理数据转储到其他文件中。你会怎么推荐倾销？逐行或再次将数据发送到缓冲区，然后进行转储？我的代码现在看起来像这样。

void dataToFiles(arrayOfStructs immenseAr1, arrayOfStructs immenseAr2){
  for (iteration over immenseAr1) {
      fprintf(f1, "%d,%d,%s\n", immenseAr1[i].a, immenseAr1[i].b, inmenseAr1[i].string);
  }
  for (iteration over immenseAr2) {
      fprintf(f2, "%d,%d,%s\n", inmenseAr2[i].a, inmenseAr2[i].b, inmenseAr2[i].string);
  }
}

我必须在转储之前读取所有数据。除了将所有数据存储到内存中然后分析它然后转储所有分析的数据之外，您会推荐一种不同的方法吗？凭借2000万行，该计划目前耗时超过40秒。我真的需要缩短那段时间。

Answer 1

使用aux=strdup(s);代替calloc（），strlen（）＆amp;的strcpy（）。
您的操作系统（文件系统）通常可以非常有效地缓冲数据流。人们可能会发现一种更有效的缓冲数据流的方法，但这种尝试通常最终会冗余地缓冲操作系统已经缓冲的内容。您的操作系统很可能提供特定的功能，允许您绕过通常由OS /文件系统完成的缓冲。一般来说，这意味着不使用＆＃34; stdio.h＆＃34;函数如fscanf（）等
同样，请注意不要不必要地对数据进行双重缓冲。请记住，操作系统将缓冲您的数据，并在通常最有效的情况下将其写出来。（这就是为什么有一个fflush（）函数...向操作系统建议你要等到它在继续之前写完所有数据。）并且，正如通常有特定的OS函数绕过OS读取缓冲区一样，通常有OS特定的功能来绕过OS写缓冲区。但是，这些功能可能超出了您的需求范围（也许是这些受众）。

我的总结答案（如上所述）是试图超越操作系统以及缓冲数据流的方式，通常会导致代码效率降低。

Answer 2

尝试扫描大文件而不将其全部存储到内存中，只需在局部变量中保留一条记录：

void csvReader(FILE *f) {
    T_structCDT c;
    int count = 0;
    c.string = malloc(1000);
    while (fscanf(f, "%d,%d,%d,%999[^,],%d\n", &c.a, &c.b, &c.vivienda, c.c, &c.d) == 5) {
        // nothing for now
        count++;
    }
    printf("%d records parsed\n");
}

测量这个简单的解析器所花费的时间：

如果速度足够快，请执行选择测试，并在解析阶段找到一个匹配记录，一次找到它们。这些步骤的额外时间应该相当小，因为只有少数记录匹配。
时间太长，你需要一个更奇特的CSV解析器，这是很多工作但可以完成并快速完成，特别是如果你可以假设你的输入文件使用这个简单的格式为所有记录。这里的细节过于宽泛，但可实现的速度应接近cat csvfile > /dev/null或grep a_short_string_not_present csvfile

在我的系统（普通硬盘的普通Linux服务器）上，从冷启动解析总计2GB的4000万行，不到2秒，第二次不到4秒：磁盘I / O似乎是瓶颈。

如果您需要经常执行此选择，则应该使用不同的数据格式，可能是数据库系统。如果偶尔对格式固定的数据执行扫描，使用SSD等更快的存储将有所帮助，但不会产生奇迹。

编辑为了实现单词，我写了一个简单的生成器和提取器：

这是一个生成CSV数据的简单程序：

#include <stdio.h>
#include <stdlib.h>

const char *dict[] = {
    "Lorem", "ipsum", "dolor", "sit", "amet;", "consectetur", "adipiscing", "elit;",
    "sed", "do", "eiusmod", "tempor", "incididunt", "ut", "labore", "et",
    "dolore", "magna", "aliqua.", "Ut", "enim", "ad", "minim", "veniam;",
    "quis", "nostrud", "exercitation", "ullamco", "laboris", "nisi", "ut", "aliquip",
    "ex", "ea", "commodo", "consequat.", "Duis", "aute", "irure", "dolor",
    "in", "reprehenderit", "in", "voluptate", "velit", "esse", "cillum", "dolore",
    "eu", "fugiat", "nulla", "pariatur.", "Excepteur", "sint", "occaecat", "cupidatat",
    "non", "proident;", "sunt", "in", "culpa", "qui", "officia", "deserunt",
    "mollit", "anim", "id", "est", "laborum.",
};

int csvgen(const char *fmt, long lines) {
    char buf[1024];

    if (*fmt == '\0')
        return 1;

    while (lines > 0) {
        size_t pos = 0;
        int count = 0;
        for (const char *p = fmt; *p && pos < sizeof(buf); p++) {
            switch (*p) {
            case '0': case '1': case '2': case '3': case '4':
            case '5': case '6': case '7': case '8': case '9':
                count = count * 10 + *p - '0';
                continue;
            case 'd':
                if (!count) count = 101;
                pos += snprintf(buf + pos, sizeof(buf) - pos, "%d",
                                rand() % (2 + count - 1) - count + 1);
                count = 0;
                continue;
            case 'u':
                if (!count) count = 101;
                pos += snprintf(buf + pos, sizeof(buf) - pos, "%u",
                                rand() % count);
                count = 0;
                continue;
            case 's':
                if (!count) count = 4;
                count = rand() % count + 1;
                while (count-- > 0 && pos < sizeof(buf)) {
                    pos += snprintf(buf + pos, sizeof(buf) - pos, "%s ",
                                    dict[rand() % (sizeof(dict) / sizeof(*dict))]);
                }
                if (pos < sizeof(buf)) {
                    pos--;
                }
                count = 0;
                continue;
            default:
                buf[pos++] = *p;
                count = 0;
                continue;
            }
        }
        if (pos < sizeof(buf)) {
            buf[pos++] = '\n';
            fwrite(buf, 1, pos, stdout);
            lines--;
        }
    }
    return 0;
}

int main(int argc, char *argv[]) {
    if (argc < 3) {
        fprintf(stderr, "usage: csvgen format number\n");
        return 2;
    }
    return csvgen(argv[1], strtol(argv[2], NULL, 0));
}

这是一个提取器，有3种不同的解析方法：

#include <stdio.h>
#include <stdlib.h>

static inline unsigned int getuint(const char *p, const char **pp) {
    unsigned int d, n = 0;
    while ((d = *p - '0') <= 9) {
        n = n * 10 + d;
        p++;
    }
    *pp = p;
    return n;
}

int csvgrep(FILE *f, int method) {
    struct {
        int a, b, c, d;
        int spos, slen;
        char s[1000];
    } c;
    int count = 0, line = 0;

    // select 500 out of 43M
#define select(c)  ((c).a == 100 && (c).b == 100 && (c).c > 74 && (c).d > 50)

    if (method == 0) {
        // default method: fscanf
        while (fscanf(f, "%d,%d,%d,%999[^,],%d\n", &c.a, &c.b, &c.c, c.s, &c.d) == 5) {
            line++;
            if (select(c)) {
                count++;
                printf("%d,%d,%d,%s,%d\n", c.a, c.b, c.c, c.s, c.d);
            }
        }
    } else
    if (method == 1) {
        // use fgets and simple parser
        char buf[1024];
        while (fgets(buf, sizeof(buf), f)) {
            char *p = buf;
            int i;
            line++;
            c.a = strtol(p, &p, 10);
            p += (*p == ',');
            c.b = strtol(p, &p, 10);
            p += (*p == ',');
            c.c = strtol(p, &p, 10);
            p += (*p == ',');
            for (i = 0; *p && *p != ','; p++) {
                c.s[i++] = *p;
            }
            c.s[i] = '\0';
            p += (*p == ',');
            c.d = strtol(p, &p, 10);
            if (*p != '\n') {
                fprintf(stderr, "csvgrep: invalid format at line %d\n", line);
                continue;
            }
            if (select(c)) {
                count++;
                printf("%d,%d,%d,%s,%d\n", c.a, c.b, c.c, c.s, c.d);
            }
        }
    } else
    if (method == 2) {
        // use fgets and hand coded parser, positive numbers only, no string copy
        char buf[1024];
        while (fgets(buf, sizeof(buf), f)) {
            const char *p = buf;
            line++;
            c.a = getuint(p, &p);
            p += (*p == ',');
            c.b = getuint(p, &p);
            p += (*p == ',');
            c.c = getuint(p, &p);
            p += (*p == ',');
            c.spos = p - buf;
            while (*p && *p != ',') p++;
            c.slen = p - buf - c.spos;
            p += (*p == ',');
            c.d = getuint(p, &p);
            if (*p != '\n') {
                fprintf(stderr, "csvgrep: invalid format at line %d\n", line);
                continue;
            }
            if (select(c)) {
                count++;
                printf("%d,%d,%d,%.*s,%d\n", c.a, c.b, c.c, c.slen, buf + c.spos, c.d);
            }
        }
    } else {
        fprintf(stderr, "csvgrep: unknown method: %d\n", method);
        return 1;
    }
    fprintf(stderr, "csvgrep: %d records selected from %d lines\n", count, line);
    return 0;
}

int main(int argc, char *argv[]) {
    if (argc > 2 && strtol(argv[2], NULL, 0)) {
        // non zero second argument -> set a 1M I/O buffer
        setvbuf(stdin, NULL, _IOFBF, 1024 * 1024);
    }
    return csvgrep(stdin, argc > 1 ? strtol(argv[1], NULL, 0) : 0);
}

以下是一些比较基准数据：

$ time ./csvgen "u,u,u,s,u" 43000000 > 43m
real    0m34.428s    user    0m32.911s    sys     0m1.358s
$ time grep zz 43m
real    0m10.338s    user    0m10.069s    sys     0m0.211s
$ time wc -lc 43m
 43000000 1195458701 43m
real    0m1.043s     user    0m0.839s     sys     0m0.196s
$ time cat 43m > /dev/null
real    0m0.201s     user    0m0.004s     sys     0m0.195s
$ time ./csvgrep 0 < 43m > x0
csvgrep: 508 records selected from 43000000 lines
real    0m14.271s    user    0m13.856s    sys     0m0.341s
$ time ./csvgrep 1 < 43m > x1
csvgrep: 508 records selected from 43000000 lines
real    0m8.235s     user    0m7.856s     sys     0m0.331s
$ time ./csvgrep 2 < 43m > x2
csvgrep: 508 records selected from 43000000 lines
real    0m3.892s     user    0m3.555s     sys     0m0.312s
$ time ./csvgrep 2 1 < 43m > x3
csvgrep: 508 records selected from 43000000 lines
real    0m3.706s     user    0m3.488s     sys     0m0.203s
$ cmp x0 x1
$ cmp x0 x2
$ cmp x0 x3

正如您所看到的，专门用于解析方法提供了近50％的增益，并且手动编码整数转换和字符串扫描获得了另外50％。使用1兆字节缓冲区而不是默认大小只能提供0.2秒的边际增益。

为了进一步提高速度，您可以使用mmap()绕过I / O流接口，并对文件内容做出更强的假设。在上面的代码中，仍然可以正常处理无效格式，但是您可以删除一些测试，并以可靠性为代价从执行时间中额外减少5％。

上述基准测试是在具有SSD驱动器的系统上执行的，文件43m适合RAM，因此时序不包含太多的磁盘I / O延迟。 grep速度出乎意料地慢，增加搜索字符串长度会使情况更糟...... wc -lc设置扫描效果的目标，因子为4，但cat似乎遥不可及。

Answer 3

选择合适的工具

有这么多数据（你说的是来自csv文件的 4300万输入行），硬盘I / O将成为瓶颈，因为你每次都处理平面文本文件你需要做一个不同的计算（如果你改变主意，想要做一些略微不同的非常非常简单的平均值，然后将所有处理过的数据输出到文件），你需要经历所有每次这个过程。

更好的策略是使用数据库管理系统，这是存储和处理大量数据的适当工具，并且可以灵活地使用索引数据进行任何处理，有效处理内存和简单的SQL命令等。

如果您不想设置SQL服务器（例如MySQL或PostgreSQL），您可以使用不需要服务器的数据库管理系统，例如SQLite：http://www.sqlite.org/，您可以，此外，从命令行驱动sqlite3 shell程序，或从C程序驱动（如果你愿意的话）（SQLite实际上是一个C库），或者使用GUI接口，如{ {3}}

SQLite将允许您创建数据库，创建表格，将CSV文件导入其中，进行计算，并以多种格式转储结果，...

使用SQLite

的演练示例

这是一个让你入门的例子，说明了sqlite3 shell程序和C代码的使用。

我们假设您在data1.csv中的数据位于您描述的fomat中，其中包含：

1,2,3,variable length string with only alphanumerics and spaces,5
11,22,33,other variable length string with only alphanumerics and spaces,55
111,222,333,yet another variable length string with only alphanumerics and spaces,555

并在data2.csv中包含：

2,3,4,from second batch variable length string with only alphanumerics and spaces,6
12,23,34,from second batch other variable length string with only alphanumerics and spaces,56
112,223,334,from second batch yet another variable length string with only alphanumerics and spaces,556

使用sqlite3命令行实用程序创建数据库，使用正确格式创建表，导入CSV文件，并发出如下SQL命令：

$ sqlite3 bigdatabase.sqlite3
SQLite version 3.8.7.1 2014-10-29 13:59:56
Enter ".help" for usage hints.
sqlite> create table alldata(col1 int, col2 int, col3 int, col4 varchar(255), col5 int);
sqlite> .mode csv
sqlite> .import data1.csv alldata
sqlite> .import data2.csv alldata
sqlite> select * from alldata;
1,2,3,"variable length string with only alphanumerics and spaces",5
11,22,33,"other variable length string with only alphanumerics and spaces",55
111,222,333,"yet another variable length string with only alphanumerics and spaces",555
2,3,4,"from second batch variable length string with only alphanumerics and spaces",6
12,23,34,"from second batch other variable length string with only alphanumerics and spaces",56
112,223,334,"from second batch yet another variable length string with only alphanumerics and spaces",556
sqlite> select avg(col2) from alldata;
82.5
sqlite>

（按Ctrl-D退出SQLite shell）

上面，我们创建了一个bigdatabase.sqlite3文件，其中包含由SQLite处理的已创建数据库，一个表alldata，我们将CSV数据导入其中，显示其中包含的数据（不要...在43百万行上执行此操作），并（计算和）显示我们命名为col2的列中包含的整数的平均值，该列恰好是第二列。

您可以将创建的SQLite数据库与C和SQLite库一起使用，以实现相同目的。

创建一个sqlite-average.c文件（改编自http://sqlitestudio.pl/中的示例），如下所示：

#include <stdio.h>
#include <sqlite3.h>

static int callback(void *NotUsed, int argc, char **argv, char **azColName) {
    int i;
    for(i=0; i<argc; i++){
        printf("%s = %s\n", azColName[i], argv[i] ? argv[i] : "NULL");
    }
    printf("\n");
    return 0;
}

int main(void) {
    sqlite3 *db;
    char *zErrMsg = 0;
    int rc;

    rc = sqlite3_open("bigdatabase.sqlite3", &db);                                                      
    if (rc) {
        fprintf(stderr, "Can't open database: %s\n", sqlite3_errmsg(db));
        sqlite3_close(db);
        return 1;
    }
    rc = sqlite3_exec(db, "select avg(col2) from alldata;", callback, 0, &zErrMsg);
    if (rc!=SQLITE_OK){
        fprintf(stderr, "SQL error: %s\n", zErrMsg);
        sqlite3_free(zErrMsg);
    }
    sqlite3_close(db);

    return 0;
}

编译它，链接到已安装的SQLite库，使用gcc，你喜欢这样：

$ gcc -Wall sqlite-average.c -lsqlite3

运行已编译的可执行文件：

$ ./a.out
avg(col2) = 82.5

$

您可能希望为查找数据的列创建索引，例如此表中的第2列和第5列，以便在那里更快地获取信息：

sqlite> create index alldata_idx ON alldata(col2,col5);

决定（如果适用）哪个列将包含表的主键等。

有关详细信息，请查看：

高效的数据/字符串读取和文件复制（CSV）c

3 个答案:

选择合适的工具

使用SQLite