Question

我想计算字符串's'中有多少个唯一单词，其中标点符号和换行符（\n）分隔每个单词。到目前为止，我已经使用逻辑或运算符来检查字符串中有多少wordSeparators，并在结果中添加1以获得字符串s中的单词数。

我当前的代码返回12作为单词的数量。由于'ab'，'AB'，'aB'，'Ab'（和'zzzz'相同）都是相同的而不是唯一的，我如何忽略单词的变体？我按照链接：http://www.cplusplus.com/reference/algorithm/unique/，但引用计算向量中的唯一项。但是，我使用字符串而不是矢量。

这是我的代码：

#include <iostream>
#include <string>
using namespace std;

bool isWordSeparator(char & c) {

    return c == ' ' || c == '-' || c == '\n' || c == '?' || c == '.' || c == ','
    || c == '?' || c == '!' || c == ':' || c == ';';
}

int countWords(string s) {
    int wordCount = 0;

    if (s.empty()) {
    return 0;
    }

    for (int x = 0; x < s.length(); x++) {
    if (isWordSeparator(s.at(x))) {
            wordCount++;

    return wordCount+1;

int main() {
    string s = "ab\nAb!aB?AB:ab.AB;ab\nAB\nZZZZ zzzz Zzzz\nzzzz";
    int number_of_words = countWords(s);

    cout << "Number of Words: " << number_of_words  << endl;

    return 0;

}

Answer 1

您可以创建一组字符串，保存最后一个分隔符的位置（从0开始）并使用substring提取单词，然后insert将其添加到集合中。完成后只需返回集合的大小。

您可以使用string::split使整个操作更容易 - 它会为您标记字符串。您所要做的就是将返回数组中的所有元素插入到集合中，然后再次返回它的大小。

编辑：根据评论，你需要一个自定义比较器来忽略比较的情况。

Answer 2

您需要使代码不区分大小写array[i].x array[i].y 您可以使用tolower()将其应用于原始字符串：

std::transform

但是我应该补充一点，你当前的代码更接近于C而不是C ++，也许你应该查看标准库提供的内容。

我建议std::transform(s.begin(), s.end(), s.begin(), ::tolower); + istringstream进行标记，并istream_iterator或unique_copy删除重复项，如下所示：https://ideone.com/nb4BEH

Answer 3

将字符串拆分为单词时，将所有单词插入std::set。这将摆脱重复。然后，只需要调用set::size()来获取唯一单词的数量。

我在我的解决方案中使用boost string algorithm library中的boost::split()函数，因为现在几乎是标准的。代码中的注释中的解释......

#include <iostream>
#include <string>
#include <set>
#include <boost/algorithm/string.hpp>
using namespace std;

// Function suggested by user 'mshrbkv':
bool isWordSeparator(char c) {
    return std::isspace(c) || std::ispunct(c);
}

// This is used to make the set case-insensitive.
// Alternatively you could call boost::to_lower() to make the
// string all lowercase before calling boost::split(). 
struct IgnoreCaseCompare { 
    bool operator()( const std::string& a, const std::string& b ) const {
        return boost::ilexicographical_compare( a, b );
    }
};

int main()
{
    string s = "ab\nAb!aB?AB:ab.AB;ab\nAB\nZZZZ zzzz Zzzz\nzzzz";

    // Define a set that will contain only unique strings, ignoring case.
    set< string, IgnoreCaseCompare > words;

    // Split the string by using your isWordSeparator function
    // to define the delimiters. token_compress_on collapses multiple
    // consecutive delimiters into only one. 
    boost::split( words, s, isWordSeparator, boost::token_compress_on );

    // Now the set contains only the unique words.
    cout << "Number of Words: " << words.size() << endl;
    for( auto& w : words )
        cout << w << endl;

    return 0;
}

演示：http://coliru.stacked-crooked.com/a/a3b51a6c6a3b4ee8

Answer 4

首先，我建议重写isWordSeparator，如下所示：

bool isWordSeparator(char c) {
    return std::isspace(c) || std::ispunct(c);
}

因为您当前的实施不会处理所有标点符号和空格，例如\t或+。

另外，wordCount为真时递增isWordSeparator是不正确的，例如，如果您有类似?!的内容。

因此，一个不太容易出错的方法是用空格替换所有分隔符，然后迭代将它们插入（无序）集合的单词：

#include <iterator>
#include <unordered_set>
#include <algorithm>
#include <cctype>
#include <sstream>

int countWords(std::string s) {
    std::transform(s.begin(), s.end(), s.begin(), [](char c) { 
        if (isWordSeparator(c)) {
            return ' ';
        }

        return std::tolower(c);
    });

    std::unordered_set<std::string> uniqWords;

    std::stringstream ss(s);
    std::copy(std::istream_iterator<std::string>(ss), std::istream_iterator<std::string(), std::inserter(uniqWords));

    return uniqWords.size();
}

Answer 5

您可以考虑SQLite c++ wrapper

在C ++中计算字符串中的唯一单词

5 个答案: