unordered_map中的高效字符串到键匹配?

时间:2014-04-06 23:56:31

标签: c++ regex c++11 string-matching unordered-map

将这些字符串映射到函数的最有效方法是哈希表:

std::string a="/foo/", b="/foo/car/", c="/foo/car/can/", d="/foo/car/haz/";

不幸的是,当你想要匹配最简单的模式时,事情变得更加复杂:

/foo/[a-Z|0-9]+>/
/foo/[a-Z|0-9]+>/bar/[a-Z|0-9]+/

有人告诉我,<regex>图书馆对我的需求来说太过分了;并且它的开销很大。

在这里使用哈希表(std::unordered_map)可能是一个有效的选择;在交换机/案例中的单个解析中检查[a-Z|0-9]+。参数的数量(在/上拆分)并使用/的数量,然后使用任意数量的参数来决定采用哪条路径:

"/foo/"                  => {<function>, "/foo/can/", "/foo/[a-Z|0-9]+/bar/"}
"/foo/xflkjkjc34v"       => {<function>, "/foo/can/", "/foo/[a-Z|0-9]+/bar/"}
"/foo/can"               => {<function>, "/foo/can/", "/foo/[a-Z|0-9]+/bar/"}
"/foo/vxcvxc86vzxc/bar/" => {<function>, "/foo/[a-Z|0-9]+/bar/haz"}

有可能实施;但这是最好的方法吗?

2 个答案:

答案 0 :(得分:0)

理想的数据结构将是一个特里结构,其中每个斜杠分隔的段与unordered_map中的最后一个无通配符字符串匹配,或甚至排序vector(可以在分别为O(1)或O(logN)),如果没有找到匹配的vector正则表达式(你可能需要逐个尝试 - O(N))。根据您的性能需求,您可以通过将常量字符串视为正则表达式并始终在特里结构中的每个节点进行O(N)搜索来简化操作。

+----------+     +---------------+                   +-----------+
| fixed:   |     | fixed:        |                   | fixed:    |
|    foo  -+---->|    bar       -|---> fn_foo_bar  --|   xxx    -|---> fn_foo_X_xxx
|    abc  -+-    |               |                /  |           |
| regexp:  | \   | regexp:       |               /   | regexp:   |
+----------+  |  |    [A-Z0-9]+ -|---------------    +-----------+
              |  +---------------+
              |
              \->+---------------+
                 | fixed:        |
                  ...

如果您对固定和reg-exp组件的潜在变化数量有更多具体见解,您可以进一步优化这一点,但这是一个具有合理可扩展性的通用解决方案。

答案 1 :(得分:0)

继我的评论之后,我认为这是一个简单且合理有效的解决方案。这是一个伪代码,因为我不知道您的问题的具体情况(例如您要映射的函数类型等)。

#define MAX_SEGMENTS 255
#define LABEL_LENGTH 10
#define KEY_LENGTH (MAX_SEGMENTS*LABEL_LENGTH)
#define LABEL_FORMAT "%10u"

// ------------------------------------------------------------------------

/**
 * Simple segment defined by position and length in a string.
 */
struct Segment
{
    unsigned pos;
    unsigned len;
};

/**
 * Example of container for regexps. 
 * This could be a tree if you had a nested structure among your regexps.
 * MyRegexp is an object that defines match( const char* segment, unsigned len )
 */
std::vector<MyRegexp> regexps;

/**
 * Mapped functions are in an unordered_map indexed by keys typically built in 
 * parse_segments below.
 */
std::unordered_map<std::string,Function*> mapped_fun;

// ------------------------------------------------------------------------

void split_address( const std::string& address, std::vector<Segment>& segments )
{
    // Split address into segments separated by '/'
}

void parse_segments( const std::string& address, const std::vector<Segment>& segments, char *key )
{
    // key should be of length MAX_SEGMENTS*LABEL_LENGTH.

    // Loop over all regular expressions for each segment.
    // If some regular expressions match a subset of others, then 
    // you have a tree structure among your regexps and you can 
    // exploit this structure to match your segments faster.

    // Here is an example of pseudo-code to create your key, assuming 
    // that you have a vector of regexps.   
    static char buf[ LABEL_LENGTH+1 ];
    for ( unsigned i = 0; i < regexps.size(); ++i )
    if ( regexps[i].match( &address[segments[i].pos], segments[i].len ) )
    {
            sprintf( buf, LABEL_FORMAT, i );
            memcpy( key+LABEL_LENGTH*i, buf, LABEL_LENGTH );
    }
}

Function* map_address( const std::string& address )
{
    // Split address into segments
    std::vector<Segment> segments;
    split_address( address, segments );

    // Match segments to regexps
    static std::string key; key.resize(KEY_LENGTH);
    parse_segments( address, segments, &key[0] );

    // Map address to function
    return mapped_fun.find(key) == mapped_fun.end() ? 
        nullptr : mapped_fun[key];
}