使用boost进行分词给出相同的输出

时间:2018-10-24 12:30:33

标签: c++ boost tokenize

我想标记很多缅甸文字​​。因此,我尝试使用vector-push-extend标记程序。

我尝试使用的文本是boost,应该将其标记为ျခင္းခတ္ခဲ့တာလို႕ျခင္း,但它只是输出输入。我在做错什么吗?

င္းျခင္း

输出应该分成一系列标记,例如: #include<iostream> #include<boost/tokenizer.hpp> #include<string> int main(){ using namespace std; using namespace boost; string s = "ျခင္းခတ္ခဲ့တာလို႕"; tokenizer<> tok(s); for(tokenizer<>::iterator beg=tok.begin(); beg!=tok.end();++beg){ cout << *beg << "\n"; } } ျခင္း,但是目前,输出等于输入。

如果可能的话,我想将此标记化为一系列带有单词边界的标记。

1 个答案:

答案 0 :(得分:1)

我不理解该语言,但是通常来说,检测单词边界不是标记化。

相反,请使用Boost Locale's Boundary Analysis

示例:

using namespace boost::locale::boundary;
boost::locale::generator gen;
std::string text="To be or not to be, that is the question."
// Create mapping of text for token iterator using global locale.
ssegment_index map(word,text.begin(),text.end(),gen("en_US.UTF-8")); 
// Print all "words" -- chunks of word boundary
for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
    std::cout <<"\""<< * it << "\", ";
std::cout << std::endl;

会打印

"To", " ", "be", " ", "or", " ", "not", " ", "to", " ", "be", ",", " ", "that", " ", "is", " ", "the", " ", "question", ".",

这句话"生きるか死ぬか、それが問題だ。"在ja_JP.UTF-8(日语)语言环境中将分为以下几段:

"生", "きるか", "死", "ぬか", "、", "それが", "問題", "だ", "。", 

演示

使用OP的文本和my_MM语言环境的演示:

Live On Coliru

#include <boost/range/iterator_range.hpp>
#include <boost/locale.hpp>
#include <boost/locale/boundary.hpp>
#include <iostream>
#include <iomanip>

int main() {
    using namespace boost::locale::boundary;
    boost::locale::generator gen;
    std::string text="ျခင္းခတ္ခဲ့တာလို႕";

    ssegment_index map(word,text.begin(),text.end(),gen("my_MM.UTF-8")); 

    for (auto&& segment : boost::make_iterator_range(map.begin(), map.end()))
        std::cout << std::quoted(segment.str()) << std::endl;
}

打印

"ျ"
"ခ"
"င္း"
"ခ"
"တ္"
"ခဲ့"
"တာ"
"လို႕"

这可能会或可能不会符合OP的预期。请注意,您可能必须在系统上生成/安装适当的语言环境才能使其按预期工作。