Question

我正在尝试使用_tcstok标记文件中的行。我能够将该行标记一次，但是当我第二次尝试对它进行标记时，我会遇到访问冲突。我觉得它与实际访问值没有关系，而是与位置有关。我不知道怎么回事。

谢谢，

戴夫

P.S。我正在使用TCHAR和_tcstok，因为该文件是UTF-8。

这是我得到的错误：

Testing.exe中0x63e866b4（msvcr90d.dll）的第一次机会异常：0xC0000005：访问冲突读取位置0x0000006c。

vector<TCHAR> TabDelimitedSource::getNext() {
// Returns the next document (a given cell) from the file(s)
TCHAR row[256]; // Return NULL if no more documents/rows
vector<TCHAR> document;

try{
    //Read each line in the file, corresponding to and individual document
    buff_reader->getline(row,10000);
    }
catch (ifstream::failure e){
        ; // Ignore and fall through
    }

if (_tcslen(row)>0){
    this->current_row += 1;
    vector<TCHAR> cells;
      //Separate the line on tabs (id 'tab' document title 'tab' document body)
     TCHAR *  pch;
     pch = _tcstok(row,"\t");
     while (pch != NULL){
         cells.push_back(*pch);
         pch = _tcstok(NULL, "\t");
     }

    // Split the cell into individual words using the lucene analyzer
    try{
      //Separate the body by spaces
        TCHAR original_document ;
        original_document = (cells[column_holding_doc]);
        try{
            TCHAR * pc;
            pc = _tcstok((char*)original_document," ");
             while (pch != NULL){
                 document.push_back(*pc);
                pc = _tcstok(NULL, "\t");
             }

Answer 1

首先，你的代码是C字符串操作和C ++容器的混合体。这只会让你陷入一个洞。理想情况下，您应该将该行标记为std::vector<std::wstring>

另外，你对TCHAR和UTF-8非常困惑。 TCHAR是一个字符类型，根据编译时标志“浮动”在8到16位之间。 UTF-8文件使用1到4个字节来表示每个字符。因此，您可能希望将文本保存为std::wstring个对象，但是您需要将UTF-8显式转换为wstrings。

但是，如果您只想让任何工作，请专注于您的标记化。您需要存储每个标记开头的地址（作为TCHAR*），但您的向量是TCHAR s的向量。当您尝试使用令牌数据时，您将TCHAR转换为TCHAR*指针，其中包含访问冲突的不足为奇的结果。您提供的AV地址是0x0000006c，它是字符l的ASCII代码。

  vector<TCHAR*> cells;
  ...
  cells.push_back(pch);

......然后......

    TCHAR *original_document = cells[column_holding_doc];
    TCHAR *pc = _tcstok(original_document," ");

使用_tcstok时访问冲突

1 个答案: