Question

所以我有以下数据字符串，它是通过TCP winsock连接接收的，并且想要进行高级标记化，进入结构体向量，其中每个结构代表一条记录。

std::string buf = "44:william:adama:commander:stuff\n33:luara:roslin:president:data\n"

struct table_t
{
    std::string key;
    std::string first;
    std::string last;
    std::string rank;
    std::additional;
};

字符串中的每条记录都由回车符分隔。我试图拆分记录，但尚未拆分字段：

    void tokenize(std::string& str, std::vector< string >records)
{
    // Skip delimiters at beginning.
    std::string::size_type lastPos = str.find_first_not_of("\n", 0);
    // Find first "non-delimiter".
    std::string::size_type pos     = str.find_first_of("\n", lastPos);
    while (std::string::npos != pos || std::string::npos != lastPos)
    {
        // Found a token, add it to the vector.
        records.push_back(str.substr(lastPos, pos - lastPos));
        // Skip delimiters.  Note the "not_of"
        lastPos = str.find_first_not_of("\n", pos);
        // Find next "non-delimiter"
        pos = str.find_first_of("\n", lastPos);
    }
}

似乎完全没必要再次重复所有代码以通过冒号（内部字段分隔符）将每个记录进一步标记化为结构并将每个结构推送到向量中。我确信有更好的方法可以做到这一点，或者设计本身就错了。

感谢您的帮助。

Answer 1

我的解决方案：

struct colon_separated_only: std::ctype<char> 
{
    colon_separated_only(): std::ctype<char>(get_table()) {}

    static std::ctype_base::mask const* get_table()
    {
        typedef std::ctype<char> cctype;
        static const cctype::mask *const_rc= cctype::classic_table();

        static cctype::mask rc[cctype::table_size];
        std::memcpy(rc, const_rc, cctype::table_size * sizeof(cctype::mask));

        rc[':'] = std::ctype_base::space; 
        return &rc[0];
    }
};

struct table_t
{
    std::string key;
    std::string first;
    std::string last;
    std::string rank;
    std::string additional;
};

int main() {
        std::string buf = "44:william:adama:commander:stuff\n33:luara:roslin:president:data\n";
        stringstream s(buf);
        s.imbue(std::locale(std::locale(), new colon_separated_only()));
        table_t t;
        std::vector<table_t> data;
        while ( s >> t.key >> t.first >> t.last >> t.rank >> t.additional )
        {
           data.push_back(t);
        }
        for(size_t i = 0 ; i < data.size() ; ++i )
        {
           cout << data[i].key <<" ";
           cout << data[i].first <<" "<<data[i].last <<" ";
           cout << data[i].rank <<" "<< data[i].additional << endl;
        }
        return 0;
}

输出：

44 william adama commander stuff
33 luara roslin president data

在线演示：http://ideone.com/JwZuk

我在这里使用的技术在我对另一个问题的另一个解决方案中进行了描述：

Elegant ways to count the frequency of words in a file

Answer 2

为了将字符串分成记录，我只使用istringstream 因为这会在我想要阅读的时候简化更改一份文件。对于标记化，最明显的解决方案是boost :: regex，所以：

std::vector<table_t> parse( std::istream& input )
{
    std::vector<table_t> retval;
    std::string line;
    while ( std::getline( input, line ) ) {
        static boost::regex const pattern(
            "\([^:]*\):\([^:]*\):\([^:]*\):\([^:]*\):\([^:]*\)" );
        boost::smatch matched;
        if ( !regex_match( line, matched, pattern ) ) {
            //  Error handling...
        } else {
            retval.push_back(
                table_t( matched[1], matched[2], matched[3],
                         matched[4], matched[5] ) );
        }
    }
    return retval;
}

（我假设table_t的逻辑构造函数。另外：有一个非常的在C中以_t结尾的名字的长期传统是typedef，所以你是最好找一些其他惯例。）

将一串数据标记为结构向量？

2 个答案: