Question

我必须在我的研究中计算10+ GB csv文件中的行数。在MATLAB上执行此操作的经典方法似乎是使用textscan()并使用\n作为分隔符，但是它具有巨大的内存占用并且非常慢。我被建议写一个Perl脚本并使用str2double(perl('countlines.pl', path))来调用它，这看起来要快得多：

# countlines.pl
while (<>) {};
print $.,"\n";

然后，我想通过编写在C中执行相同操作的MEX函数来获得任何优势，但没有运气，更令人惊讶的是，我发现这比Perl脚本慢了大约10倍（使用Xcode 4.6.3上的LLVM编译器：

//countlines.c

#include "mex.h"

void countlines(char *filepath, double *numLines)
{
    /* Routine */

    numLines[0] = 0;
    FILE *inputFile = fopen(filepath, "r");
    int ch;

    while (EOF != (ch=getc(inputFile)))
        if ('\n' == ch)
            ++numLines[0];
}

void mexFunction( int nlhs, mxArray *plhs[],
                  int nrhs, const mxArray *prhs[])
{
    /* Gateway function */

    int bufferLength, status;
    char *filepath;                 // Input: File path
    double *numLines;               // Output Number of lines

    bufferLength = (mxGetM(prhs[0]) * mxGetN(prhs[0])) + 1; // Get length of string
    filepath = mxCalloc(bufferLength, sizeof(char)); // Allocate memory for input

    // Copy the string data from prhs[0] into a C string
    status = mxGetString(prhs[0], filepath, bufferLength);
    if (status != 0)
        mexErrMsgIdAndTxt("utils:countlines:insufficientSpace", "Insufficient space, string is truncated.");

    // Create the output matrix and get a pointer to the real data in the output matrix
    plhs[0] = mxCreateDoubleMatrix(1,(mwSize)1,mxREAL);
    numLines = mxGetPr(plhs[0]);

    // Call the C routine
    countlines(filepath, numLines);
}

所以，

除了网关功能之外，这个开销来自MEX功能？
我能做些什么来加快速度吗？我可以使用任何语言，只要我们能够使用MATLAB进行接口。似乎唯一的另一种方法是内存映射文件的块，并将工作负载分成几个核心。

Answer 1

除了网关功能之外，MEX功能中的开销来自何处？

MEX功能正在分配内存。
该功能正在将内存转换为字符串。
该功能正在创建双打矩阵。

无法与Perl的简单行计数功能进行比较，因为它们在功能上并不相同。

我能做些什么来加快速度吗？是的，只计算行数没有额外的东西，比如阅读双打矩阵。

以下是使用C ++对文本文件中的行进行计数的示例：

std::ifstream text_file(/*...*/);
std::string   text_from_file;
unsigned int  line_count = 0;
while (std::getline(text_file, '\n'))
{
  ++line_count;
}

比较效果时，功能必须相同。

编辑1：
决定。你在计算线数吗？

你在计算矩阵中的行数吗？

您想仅计算文件中的行吗？

如果要计算矩阵中的行数，则需要修改Perl脚本。

如果您希望MEX功能仅计数行，请删除除countlines功能调用之外的所有内容。

为什么使用double来计算行数？你期待分数线数吗？

您想使用C I / O还是C ++ I / O？

以块为单位读取数据将加速C I / O功能：

#define MAX_CHUNK_SIZE 1024*1024
char buffer[MAX_CHUNK_SIZE];
size_t chars_read = 0;
unsigned int line_count = 0;
//...
while (!feof(inputFile))
{
  chars_read = fread(buffer, 1, MAX_CHUNK_SIZE, input_file);
  char c;
  for (unsigned int i = 0; i < chars_read; ++i)
  {
     if (c == '\n')
     {
       ++line_count;
     }
  }
}

访问文件的瓶颈是定位数据的开销。大量读取可以减少开销。

Answer 2

您是否阅读了有关此主题的Perl常见问题解答，其中提供了大约6个示例？

perldoc -q 'How do I count the number of lines in a file'

wc命令已移植到Windows，因此如果要安装它，这可能是最佳解决方案。否则，我会在wc示例之前使用Perl示例（在下面修复并优化）。

    my $lines = 0;
    open my $fh, '<:raw', $filename
        or die "Can't open $filename: $!";
    while( sysread $fh, $buffer, 64*1024 ) {
        $lines += ( $buffer =~ tr|\n||; );
    }
    close $fh;

Answer 3

在这段代码中计算总行数。但它需要几个迷你。

my $lines = do {
    open my $fh, '<', "filename" or die "Can't open filename: $!";
    1 while (<$fh>);
    $.
};
print "Total number of lines: $lines\n";

Answer 4

有效地计算行数，只需执行以下操作：

int main()
{
   unsigned long lines = 0;
   int c; /* c must be an int, not char */

   while ((c = getchar()) != EOF) 
      if (c == '\n') 
         lines++;
   printf("%lu\n", lines);
   return 0;
} /* main */

我认为Kernighan＆amp; amp;里奇，如果不一样的话。请不要使用double来计算下次。使用浮点数整数类型计算效率要高得多。

在MATLAB中计算文件中行的最快方法（Perl比C更快？）

4 个答案: