Question

作为更大程序的一部分，我从文本文件中提取单个句子，并将它们作为字符串放入字符串向量中。我首先决定使用我已经注释掉的程序。但是，经过测试，我意识到它做了两件事：

（1）当用新线分隔时，它不会分隔句子。（2）当它们以引号结尾时，不会将句子分开。（例如，句子 奥巴马说道，“是的，我们可以。”然后他的听众发出雷鸣般的掌声。 不会分开。）

我需要解决这些问题。但是，我担心这最终会成为意大利面条代码，如果还没有的话。我错了吗？我不想继续回去修理东西。也许有一些更简单的方法？

// Extract sentences from Plain Text file 
std::vector<std::string> get_file_sntncs(std::fstream& file) { 
    // The sentences will be stored in a vector of strings, strvec:
    std::vector<std::string> strvec; 
    // Print out error if the file could not be found: 
    if(file.fail()) {
        std::cout << "Could not find the file. :( " << std::endl;
    // Otherwise, proceed to add the sentences to strvec. 
    } else { 
        char curchar;
        std::string cursentence;
    /* While we haven't reached the end of the file, add the current character to the 
       string representing the current sentence. If that current character is a period, 
       then we know we've reached the end of a sentence if the next character is a space or 
       if there is no next character; we then must add the current sentence to strvec. */
        while (file >> std::noskipws >> curchar) { 
           cursentence.push_back(curchar);
            if (curchar == '.') {
                if (file >> std::noskipws >> curchar) { 
                    if (curchar == ' ') {
                        strvec.push_back(cursentence);
                        cursentence.clear();
                    } else { 
                        cursentence.push_back(curchar);
                    }
                } else { 
                    strvec.push_back(cursentence);
                    cursentence.clear();
                }

            }

        }

    }
    return strvec; 
}

Answer 1

鉴于您要求通过标点符号，空格以及它们的某些组合来检测句子边界，使用正则表达式似乎是一个很好的解决方案。您可以使用正则表达式来描述指示句子边界的可能字符序列，例如

[.!?]\s+

表示：“点，感叹号问号之一，后跟一个或多个空格”。

在C ++中使用正则表达式的一种特别方便的方法是使用Boost库中包含的正则表达式实现。以下是一个如何在您的情况下工作的示例：

#include <string>
#include <vector>
#include <iostream>
#include <iterator>
#include <boost/regex.hpp>

int main()
{
  /* Input. */
  std::string input = "Here is a short sentence. Here is another one. And we say \"this is the final one.\", which is another example.";

  /* Define sentence boundaries. */
  boost::regex re("(?: [\\.\\!\\?]\\s+" // case 1: punctuation followed by whitespace
                  "|   \\.\\\",?\\s+"   // case 2: start of quotation
                  "|   \\s+\\\")",      // case 3: end of quotation
           boost::regex::perl | boost::regex::mod_x);

  /* Iterate through sentences. */
  boost::sregex_token_iterator it(begin(input),end(input),re,-1);
  boost::sregex_token_iterator endit;

  /* Copy them onto a vector. */
  std::vector<std::string> vec;
  std::copy(it,endit,std::back_inserter(vec));

  /* Output the vector, so we can check. */
  std::copy(begin(vec),end(vec),
            std::ostream_iterator<std::string>(std::cout,"\n"));

  return 0;
}

注意我使用了boost::regex::perl和boost:regex:mod_x选项来构造正则表达式匹配器。这允许在正则表达式中使用额外的空格以使其更具可读性。

另请注意，某些字符（例如.（点），!（感叹号）和其他字符需要转义（即您需要将\\放在它们前面），因为他们会将具有特殊含义的元字符加载。

编译/链接上面的代码时，需要将其与boost-regex库链接。使用GCC命令看起来像：

g++ -W -Wall -std=c++11 -o test test.cpp -lboost_regex

（假设您的程序存储在名为test.cpp的文件中）。

从文本文件中提取单个句子...我没有把它弄好

1 个答案: