c ++指定用于从文本文件中读取单词的分隔符

时间:2015-12-21 18:58:20

标签: c++

我有以下代码从文本文件打印每个唯一的单词及其计数(包含> = 30k单词),但是它按空格分隔单词,我有这样的结果:

enter image description here

如何修改代码以指定预期的分隔符?

template <class KTy, class Ty>
void PrintMap(map<KTy, Ty> map)
{
    typedef std::map<KTy, Ty>::iterator iterator;
    for (iterator p = map.begin(); p != map.end(); p++)
        cout << p->first << ": " << p->second << endl;
}

void UniqueWords(string fileName) {
    // Will store the word and count.
    map<string, unsigned int> wordsCount;

    // Begin reading from file:
    ifstream fileStream(fileName);

    // Check if we've opened the file (as we should have).
    if (fileStream.is_open())
        while (fileStream.good())
        {
            // Store the next word in the file in a local variable.
            string word;
            fileStream >> word;

            //Look if it's already there.
            if (wordsCount.find(word) == wordsCount.end()) // Then we've encountered the word for a first time.
                wordsCount[word] = 1; // Initialize it to 1.
            else // Then we've already seen it before..
                wordsCount[word]++; // Just increment it.
        }
    else  // We couldn't open the file. Report the error in the error stream.
    {
        cerr << "Couldn't open the file." << endl;
    }

    // Print the words map.
    PrintMap(wordsCount);
}

3 个答案:

答案 0 :(得分:2)

您可以使用带有std::ctype<char>构面imbue() ed的流,它会将您想要的任何字符视为空格。这样做会是这样的:

#include<locale>
#include<cctype>

struct myctype_table {
    std::ctype_base::mask table[std::ctype<char>::table_size];
    myctype_table(char const* spaces) {
        while (*spaces) {
            table[static_cast<unsigned char>(*spaces)] = std::ctype_base::isspace;
        }
    }
};
class myctype
    : private myctype_table,
    , public std::ctype<char> {
public:
    myctype(char const* spaces)
        : myctype_table(spaces)
        , std::ctype<char>(table) {
    };
};

int main() {
     std::locale myloc(std::locale(), new myctype(" \t\n\r?:.,!"));
     std::cin.imbue(myloc);
     for (std::string word; std::cin >> word; ) {
         // words are separated by the extended list of spaces
     }
}

此代码现在没有测试 - 我在移动设备上输入。我可能误用了一些std::cypte<char>接口,但在修复名称等之后,这些行应该有效。

答案 1 :(得分:1)

正如您所期望的那样,在找到单词末尾的禁止字符时,您可以在将单词转换为wordsCount之前将其删除:

if(word[word.length()-1] == ';' || word[word.length()-1] == ',' || ....){
   word.erase(word.length()-1);
}

答案 2 :(得分:0)

fileStream >> word;之后,您可以调用此函数。看看是否清楚:

string adapt(string word) {
    string forbidden = "!?,.[];";
    string ret = "";
    for(int i = 0; i < word.size(); i++) {
        bool ok = true;
        for(int j = 0; j < forbidden.size(); j++) {
            if(word[i] == forbidden[j]) {
                ok = false;
                break;
            }
        }
        if(ok)
            ret.push_back(word[i]);
    }
    return ret;
}

这样的事情:

fileStream >> word;
word = adapt(word);