Question

我有一个包含文本（ASCII + unicode）的文件，我正在尝试使用C ++程序计算其中的总字数。我需要逐行读取文件（使用getline），然后处理每一行以计算其中的单词。

所以我编写了以下简单程序：

#include <iostream>
#include <fstream>
#include <sstream>
#include <string>

int main(int argc, char* argv[]) {
  uint64_t ct = 0;
  std::string line;
  std::ifstream infile(argv[1]);
  while(std::getline(infile, line)) {
    std::stringstream inputStream(line);
    std::string token;
    while (inputStream >> token) {
      ++ct;
    }
  }

  std::cout << ct << std::endl;

  return 0;
}

但是，上述程序输出的数字小于wc -w命令给出的数字。为了缩小问题范围，我修改了程序以简单地输出它读取的内容。所以现在程序变成了：

int main(int argc, char* argv[]) {
  uint64_t ct = 0;
  std::string line;
  std::ifstream infile(argv[1]);
  while(std::getline(infile, line)) {
    std::stringstream inputStream(line);
    std::string token;
    while (inputStream >> token) {
      std::cout << token << " ";
    }
    std::cout << std::endl;
  }

  return 0;
}

我将此程序的输出重定向到另一个文件。现在，当我在这个新文件上运行wc -w时，该数字与在原始文件上运行wc -w的数字相同。这意味着，我正在阅读我的程序中的所有单词（即wc定义的“单词”）。因此，合理的解释是使用token读取的inputStream >> token值之一包含一些由wc程序解释为空格的unicode字符。那么如何更改我的程序以支持unicode空格字符的解释呢？

Answer 1

您可以通过以下任何一种方式

A。 Java's definition的 Unicode（非不间断）空白。

或

B。 Wikipedia's list，其中所有25个Unicode代码点定义为空格。

C ++：解释unicode空格

1 个答案: