C ++中字符串的标记化

时间:2013-09-30 14:33:33

标签: c++ string tokenize

我使用以下代码将每个单词拆分为每行的标记。我的问题在于:我希望不断更新文件中的令牌数量。该文件的内容是:

Student details:
Highlander 141A Section-A.
Single 450988012 SA

程序:

#include <iostream>
using std::cout;
using std::endl;

#include <fstream>
using std::ifstream;

#include <cstring>

const int MAX_CHARS_PER_LINE = 512;
const int MAX_TOKENS_PER_LINE = 20;
const char* const DELIMITER = " ";

int main()
{
  // create a file-reading object
  ifstream fin;
  fin.open("data.txt"); // open a file
  if (!fin.good()) 
    return 1; // exit if file not found

  // read each line of the file
  while (!fin.eof())
  {
    // read an entire line into memory
    char buf[MAX_CHARS_PER_LINE];
    fin.getline(buf, MAX_CHARS_PER_LINE);

    // parse the line into blank-delimited tokens
    int n = 0; // a for-loop index

    // array to store memory addresses of the tokens in buf
    const char* token[MAX_TOKENS_PER_LINE] = {}; // initialize to 0

    // parse the line
    token[0] = strtok(buf, DELIMITER); // first token
    if (token[0]) // zero if line is blank
    {
      for (n = 1; n < MAX_TOKENS_PER_LINE; n++)
      {
        token[n] = strtok(0, DELIMITER); // subsequent tokens
        if (!token[n]) break; // no more tokens
      }
    }

    // process (print) the tokens
    for (int i = 0; i < n; i++) // n = #of tokens
      cout << "Token[" << i << "] = " << token[i] << endl;
      cout << endl;
  }
}

输出:

Token[0] = Student
Token[1] = details:

Token[0] = Highlander
Token[1] = 141A
Token[2] = Section-A.

Token[0] = Single
Token[1] = 450988012
Token[2] = SA

预期:

Token[0] = Student
Token[1] = details:

Token[2] = Highlander
Token[3] = 141A
Token[4] = Section-A.

Token[5] = Single
Token[6] = 450988012
Token[7] = SA

所以我希望它增加,以便我可以通过变量名轻松识别值。提前谢谢......

2 个答案:

答案 0 :(得分:2)

标准的,惯用的解决方案有什么问题:

std::string line;
while ( std::getline( fin, line ) ) {
    std::istringstream parser( line );
    int i = 0;
    std::string token;
    while ( parser >> token ) {
        std::cout << "Token[" << i << "] = " << token << std::endl;
        ++ i;
    }
}

显然,在现实生活中,你要做的不仅仅是 输出每个令牌,你会想要更复杂的解析。 但是,无论何时你进行面向行的输入,上面都是 你应该使用的模型(可能跟踪线 数字以及错误消息)。

值得指出的是,在这种情况下,一个偶数 更好的解决方案是在外部使用boost::split 循环,以获取令牌的向量。

答案 1 :(得分:0)

我只是让iostream进行分裂

std::vector<std::string> token;
std::string s;
while (fin >> s)
    token.push_back(s);

然后你可以用适当的索引一次输出整个数组。

for (int i = 0; i < token.size(); ++i)
    cout << "Token[" << i << "] = " << token[i] << endl;

更新

你甚至可以完全省略矢量并在输入strieam中读取它们时输出标记

std::string s;
for (int i = 0; fin >> s; ++i)
    std::cout << "Token[" << i << "] = " << token[i] << std::endl;