Question

我使用grep来计算文件graph.tcl中字符串“^ mj”的出现次数。我写的命令很简单，你很容易理解。

grep“^ mj”mjwork / run / graph.tcl | wc -l </ p>

它会在46625之后输出~45 min。你们能提出一个更好的方法吗？可以减少时间吗？

感谢!!!

Answer 1

以下行可能会使其更快：

$ awk '/^mj/{c++}END{print c}' file

这将仅处理文件一次，并且仅打印匹配的总数。这与您最初的情况相反，在最初的情况下，您要求grep将所有内容打印到缓冲区中，然后再次使用wc处理。

最后，您也可以这样做：

$ grep -c '^mj' file

仅返回总匹配项。这可能甚至比awk版本更快。默认情况下，Awk将尝试拆分字段，上述grep不需要执行此操作。

有很多原因导致您的进程运行缓慢，磁盘负载沉重，如果使用它会导致nfs缓慢，要解析的行过长，...在输入文件和运行的系统上没有更多信息这样，很难说为什么这么慢。

Answer 2

听起来像你的机器。你有足够的交换空间等吗？ df -h显示什么？作为测试，请尝试egrep或fgrep作为grep的替代。

Answer 3

你应该尝试我刚才做的这个小C程序。

#define _FILE_OFFSET_BITS 64

#include <string.h>
#include <stdio.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include <unistd.h>

const char needle[] = "mj";

int main(int argc, char * argv[]) {
  int fd, i, res, count;
  struct stat st;
  char * data;

  if (argc != 2) {
    fprintf(stderr, "Syntax: %s file\n", *argv);
    return 1;
  }

  fd = open(argv[1], O_RDONLY);
  if (fd < 0) {
    fprintf(stderr, "Couldn't open file \"%s\": %s\n", argv[1], strerror(errno));
    return 1;
  }

  res = fstat(fd, &st);
  if (res < 0) {
    fprintf(stderr, "Failed at fstat: %s\n", strerror(errno));
    return 1;
  }

  if (!S_ISREG(st.st_mode)) {
    fprintf(stderr, "File \"%s\" is not a regular file.\n", argv[1]);
    return 1;
  }

  data = mmap(NULL, st.st_size, PROT_READ, MAP_SHARED, fd, 0);
  if (!data) {
    fprintf(stderr, "mmap failed!: %s\n", strerror(errno));
    return 1;
  }

  count = 0;
  for (i = 0; i < st.st_size; i++) {
    // look for string:
    if (i + sizeof needle - 1 < st.st_size
    && !memcmp(data + i, needle, sizeof needle - 1)) {
      count++;
      i += sizeof needle - 1;
    }
    while (data[i] != '\n' && i < st.st_size)
      i++;
  }

  printf("%d\n", count);

  return 0;
}

将其编译为：gcc grepmj.c -o grepmj -O2

为什么grep花了这么多时间？

3 个答案: