Boost :: Spirit解析器:寻找最大性能和最小内存使用量

时间:2018-03-30 09:20:19

标签: boost boost-spirit

关于解析复杂日志的几个问题后,我终于被告知这样做的最佳方法。

现在的问题是,是否有某种方法可以提高性能和/或减少内存使用量,甚至是编译时间。我会要求答案满足这些限制:

  1. MS VS 2010(不完全是c ++ 11,只是实现了一些功能:auto,lambdas ......)和boost 1.53(这里唯一的问题是string_view仍然不可用,但是使用string_ref仍然有效,甚至可能会在将来弃用它。)

  2. 日志被压缩,并使用一个打开的库直接解压缩到ram内存,该库输出一个旧的原始C" char"数组,所以不值得使用std::string,因为内存已经由库分配。它们有数千个并且它们填充了几GB,因此将它们保存在内存中是不可取的。我的意思是,由于在解析日志后删除了日志,因此无法使用string_view

  3. 将日期字符串解析为POSIX时间可能是个好主意。只有两个评论:避免为此分配一个字符串应该是有趣的,因为我知道POSIX时间不允许ms,所以它们应该保存在另一个额外的变量中。

  4. 通过日志(道路变量p.e.)重复一些字符串。使用一些flyweight模式(它的boost实现)来减少内存可能会很有趣,即使记住这会有性能成本。

  5. 使用模板库时编译时间很麻烦。我非常感谢任何有助于减少它们的调整:可能将语法分成子语法?也许使用预编译的标题?

  6. 最后一次使用此方法是对任何事件进行查询,例如:获取所有GEAR事件(值和时间),并在固定间隔内或每次事件发生时记录所有汽车变量。日志中有两种类型的记录:pure" Location"记录和"位置+事件"记录(我的意思是,每次解析一个事件时,也必须保存该位置)。将它们分成两个向量可以实现快速查询,但会降低解析速度。仅使用公共向量允许快速解析但减慢查询速度。对此有何想法?也许提升多索引容器会有所帮助吗?

  7. 请不要犹豫,提供任何解决方案或更改您认为可能有助于实现目标的任何内容。

    //#define BOOST_SPIRIT_DEBUG
    #include <boost/fusion/adapted/struct.hpp>
    #include <boost/spirit/include/qi.hpp>
    #include <cstring> // strlen
    
    typedef char const* It;
    
    namespace MyEvents {
        enum Kind { LOCATION, SLOPE, GEAR, DIR };
    
        struct Event {
            Kind kind;
            double value;
        };
    
        struct LogRecord {
            int driver;        
            double time;
            double vel;
            double km;
            std::string date;
            std::string road;
            Event event;
        };
    
        typedef std::vector<LogRecord> LogRecords;
    }
    
    BOOST_FUSION_ADAPT_STRUCT(MyEvents::Event,
        (MyEvents::Kind, kind)
        (double, value))
    
    
    BOOST_FUSION_ADAPT_STRUCT(MyEvents::LogRecord,
            (std::string, date)
            (double, time)
            (int, driver)
            (double, vel)
            (std::string, road)
            (double, km)
            (MyEvents::Event, event))
    
    namespace qi = boost::spirit::qi;
    
    namespace QiParsers {
        template <typename It>
        struct LogParser : qi::grammar<It, MyEvents::LogRecords()> {
    
            LogParser() : LogParser::base_type(start) {
                using namespace qi;
    
                kind.add
                    ("SLOPE", MyEvents::SLOPE)
                    ("GEAR", MyEvents::GEAR)
                    ("DIR", MyEvents::DIR);
    
                values.add("G1", 1.0)
                          ("G2", 2.0)
                          ("REVERSE", -1.0)
                          ("NORTH", 1.0)
                          ("EAST", 2.0)
                          ("WEST", 3.0)
                          ("SOUTH", 4.0);
    
                MyEvents::Event null_event = {MyEvents::LOCATION, 0.0};
    
                line_record
                    = '[' >> raw[repeat(4)[digit] >> '-' >> repeat(3)[alpha] >> '-' >> repeat(2)[digit] >> ' ' >> 
                                 repeat(2)[digit] >> ':' >> repeat(2)[digit] >> ':' >> repeat(2)[digit] >> '.' >> repeat(6)[digit]] >> "]"
                    >> " - " >> double_ >> " s"
                    >> " => Driver: "  >> int_
                    >> " - Speed: "    >> double_
                    >> " - Road: "     >> raw[+graph]
                    >> " - Km: "       >> double_
                    >> (" - " >> kind >> ": " >> (double_ | values) | attr(null_event));
    
                start = line_record % eol;
    
                //BOOST_SPIRIT_DEBUG_NODES((start)(line_record))
            }
    
          private:
            qi::rule<It, MyEvents::LogRecords()> start;
    
            qi::rule<It, MyEvents::LogRecord()> line_record;
    
            qi::symbols<char, MyEvents::Kind> kind;
            qi::symbols<char, double> values;
        };
    }
    
    MyEvents::LogRecords parse_spirit(It b, It e) {
        static QiParsers::LogParser<It> const parser;
    
        MyEvents::LogRecords records;
        parse(b, e, parser, records);
    
        return records;
    }
    
    static char input[] = 
    "[2018-Mar-13 13:13:59.580482] - 0.200 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.0 - SLOPE: 5.5\n\
    [2018-Mar-13 13:14:01.170203] - 1.790 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.0 - GEAR: G1\n\
    [2018-Mar-13 13:14:01.170203] - 1.790 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.0 - DIR: NORTH\n\
    [2018-Mar-13 13:14:01.170203] - 1.790 s => Driver: 0 - Speed: 0.1 - Road: A-11 - Km: 90.0\n\
    [2018-Mar-13 13:14:01.170203] - 1.980 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.1 - GEAR: G2\n\
    [2018-Mar-13 13:14:01.819966] - 2.440 s => Driver: 0 - Speed: 0.1 - Road: B-16 - Km: 90.2\n\
    [2018-Mar-13 13:14:01.819966] - 2.440 s => Driver: 0 - Speed: 0.1 - Road: B-16 - Km: 90.2 - DIR: EAST\n\
    [2018-Mar-13 13:15:01.819966] - 3.440 s => Driver: 0 - Speed: 0.2 - Road: B-16 - Km: 90.3 - SLOPE: -10\n\
    [2018-Mar-13 13:14:01.170203] - 1.980 s => Driver: 0 - Speed: 0.0 - Road: B-16 - Km: 90.4 - GEAR: REVERSE\n";
    static const size_t len = strlen(input);
    
    namespace MyEvents { // for debug/demo
        using boost::fusion::operator<<;
    
        static inline std::ostream& operator<<(std::ostream& os, Kind k) {
            switch(k) {
                case LOCATION: return os << "LOCATION";
                case SLOPE:    return os << "SLOPE";
                case GEAR:     return os << "GEAR";
                case DIR:      return os << "DIR";
            }
            return os;
        }
    }
    
    int main() {
        MyEvents::LogRecords records = parse_spirit(input, input+len);
        std::cout << "Parsed: " << records.size() << " records\n";
    
        for (MyEvents::LogRecords::const_iterator it = records.begin(); it != records.end(); ++it)
            std::cout << *it << "\n"; 
    
        return 0;
    }
    

1 个答案:

答案 0 :(得分:3)

string_ref基本相同,但在某些时候使用与std::string_view略有不同的界面

修订版1:POSIX时间

存储POSIX时间非常简单:

#include <boost/date_time/posix_time/posix_time_io.hpp>

接下来,替换类型:

typedef boost::posix_time::ptime Timestamp;

struct LogRecord {
    int driver;
    double time;
    double vel;
    double km;
    Timestamp date;    // << HERE using Timestamp now
    std::string road;
    Event event;
};

将解析器简化为:

'[' >> stream >> ']'

打印 Live On Coliru

Parsed: 9 records
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-11 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-11 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-11 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-16 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-16 90.4 (GEAR -1))

修订版#2:压缩文件

您也可以使用IOStreams透明地解压缩输入:

int main(int argc, char **argv) {
    MyEvents::LogRecords records;

    for (char** arg = argv+1; *arg && (argv+argc != arg); ++arg) {
        bool ok = parse_logfile(*arg, records);

        std::cout 
            << "Parsing " << *arg << (ok?" - success" : " - errors")
            << " (" << records.size() << " records total)\n";
    }

    for (MyEvents::LogRecords::const_iterator it = records.begin(); it != records.end(); ++it)
        std::cout << *it << "\n"; 
}

parse_logfile然后可以实现为:

template <typename It>
bool parse_spirit(It b, It e, MyEvents::LogRecords& into) {
    static QiParsers::LogParser<It> const parser;

    return parse(b, e, parser, into);
}

bool parse_logfile(char const* fname, MyEvents::LogRecords& into) {
    boost::iostreams::filtering_istream is;
    is.push(boost::iostreams::gzip_decompressor());

    std::ifstream ifs(fname, std::ios::binary);
    is.push(ifs);

    boost::spirit::istream_iterator f(is >> std::noskipws), l;
    return parse_spirit(f, l, into);
}
  

注意:该库具有zlib,gzip和bzip2解压缩程序。我选择了gzip进行演示

打印 Live On Coliru

Parsing input.gz - success (9 records total)
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-11 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-11 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-11 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-16 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-16 90.4 (GEAR -1))

修订版3:字符串实习

“Interned”字符串或“Atoms”是减少字符串分配的常用方法。你可以使用Boost Flyweight,但根据我的经验,要做到这一点有点复杂。那么,为什么不创建自己的抽象:

struct StringTable {
    typedef boost::string_ref Atom;
    typedef boost::container::flat_set<Atom> Index;
    typedef std::deque<char> Store;

    /* An insert in the middle of the deque invalidates all the iterators and
     * references to elements of the deque. An insert at either end of the
     * deque invalidates all the iterators to the deque, but has no effect on
     * the validity of references to elements of the deque.
     */
    Store backing;
    Index index;

    Atom intern(boost::string_ref const& key) {
        Index::const_iterator it = index.find(key);

        if (it == index.end()) {
            Store::const_iterator match = std::search(
                    backing.begin(), backing.end(),
                    key.begin(), key.end());

            if (match == backing.end()) {
                size_t offset = backing.size();
                backing.insert(backing.end(), key.begin(), key.end());
                match = backing.begin() + offset;
            }

            it = index.insert(Atom(&*match, key.size())).first;
        }
        // return the Atom from backing store
        return *it;
    }
};

现在,我们需要将其集成到解析器中。我建议使用语义动作

  

注意:特征在这里仍然有用,但它们是静态的,这需要StringTable是全局的,这是我永远不会做出的选择......除非绝对有义务

首先,改变Ast:

struct LogRecord {
    int driver;
    double time;
    double vel;
    double km;
    Timestamp date;
    Atom road;       // << HERE using Atom now
    Event event;
};

接下来,让我们创建一个创建这样一个原子的规则:

qi::rule<It, MyEvents::Atom()> atom;

atom = raw[+graph][_val = intern_(_1)];

当然,这引出了如何实现语义行为的问题:

struct intern_f {
    StringTable& _table;

    typedef StringTable::Atom result_type;
    explicit intern_f(StringTable& table) : _table(table) {}

    StringTable::Atom operator()(boost::iterator_range<It> const& range) const {
        return _table.intern(sequential(range));
    }

  private:
    // be more efficient if It is const char*
    static boost::string_ref sequential(boost::iterator_range<const char*> const& range) {
        return boost::string_ref(range.begin(), range.size());
    }
    template <typename OtherIt>
    static std::string sequential(boost::iterator_range<OtherIt> const& range) {
        return std::string(range.begin(), range.end());
    }
};
boost::phoenix::function<intern_f> intern_;

语法的构造函数将intern_仿函数挂钩到传入的StringTable&

完整演示

<强> Live On Coliru

//#define BOOST_SPIRIT_DEBUG
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/date_time/posix_time/posix_time_io.hpp>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/gzip.hpp>
#include <boost/utility/string_ref.hpp>
#include <boost/container/flat_set.hpp>
#include <fstream>
#include <cstring> // strlen

struct StringTable {
    typedef boost::string_ref Atom;
    typedef boost::container::flat_set<Atom> Index;
    typedef std::deque<char> Store;

    /* An insert in the middle of the deque invalidates all the iterators and
     * references to elements of the deque. An insert at either end of the
     * deque invalidates all the iterators to the deque, but has no effect on
     * the validity of references to elements of the deque.
     */
    Store backing;
    Index index;

    Atom intern(boost::string_ref const& key) {
        Index::const_iterator it = index.find(key);

        if (it == index.end()) {
            Store::const_iterator match = std::search(
                    backing.begin(), backing.end(),
                    key.begin(), key.end());

            if (match == backing.end()) {
                size_t offset = backing.size();
                backing.insert(backing.end(), key.begin(), key.end());
                match = backing.begin() + offset;
            }

            it = index.insert(Atom(&*match, key.size())).first;
        }
        // return the Atom from backing store
        return *it;
    }
};

namespace MyEvents {
    enum Kind { LOCATION, SLOPE, GEAR, DIR };

    struct Event {
        Kind kind;
        double value;
    };

    typedef boost::posix_time::ptime Timestamp;
    typedef StringTable::Atom Atom;

    struct LogRecord {
        int driver;
        double time;
        double vel;
        double km;
        Timestamp date;
        Atom road;
        Event event;
    };

    typedef std::vector<LogRecord> LogRecords;
}

BOOST_FUSION_ADAPT_STRUCT(MyEvents::Event,
        (MyEvents::Kind, kind)
        (double, value))

BOOST_FUSION_ADAPT_STRUCT(MyEvents::LogRecord,
        (MyEvents::Timestamp, date)
        (double, time)
        (int, driver)
        (double, vel)
        (MyEvents::Atom, road)
        (double, km)
        (MyEvents::Event, event))

namespace qi = boost::spirit::qi;

namespace QiParsers {
    template <typename It>
    struct LogParser : qi::grammar<It, MyEvents::LogRecords()> {

        LogParser(StringTable& strings) : LogParser::base_type(start), intern_(intern_f(strings)) {
            using namespace qi;

            kind.add
                ("SLOPE", MyEvents::SLOPE)
                ("GEAR", MyEvents::GEAR)
                ("DIR", MyEvents::DIR);

            values.add("G1", 1.0)
                      ("G2", 2.0)
                      ("REVERSE", -1.0)
                      ("NORTH", 1.0)
                      ("EAST", 2.0)
                      ("WEST", 3.0)
                      ("SOUTH", 4.0);

            MyEvents::Event null_event = {MyEvents::LOCATION, 0.0};

            atom = raw[+graph][_val = intern_(_1)];

            line_record
                = '[' >> stream >> ']'
                >> " - " >> double_ >> " s"
                >> " => Driver: "  >> int_
                >> " - Speed: "    >> double_
                >> " - Road: "     >> atom
                >> " - Km: "       >> double_
                >> (" - " >> kind >> ": " >> (double_ | values) | attr(null_event));

            start = line_record % eol;

            BOOST_SPIRIT_DEBUG_NODES((start)(line_record)(atom))
        }

      private:
        struct intern_f {
            StringTable& _table;

            typedef StringTable::Atom result_type;
            explicit intern_f(StringTable& table) : _table(table) {}

            StringTable::Atom operator()(boost::iterator_range<It> const& range) const {
                return _table.intern(sequential(range));
            }

          private:
            // be more efficient if It is const char*
            static boost::string_ref sequential(boost::iterator_range<const char*> const& range) {
                return boost::string_ref(range.begin(), range.size());
            }
            template <typename OtherIt>
            static std::string sequential(boost::iterator_range<OtherIt> const& range) {
                return std::string(range.begin(), range.end());
            }
        };
        boost::phoenix::function<intern_f> intern_;

        qi::rule<It, MyEvents::LogRecords()> start;

        qi::rule<It, MyEvents::LogRecord()> line_record;
        qi::rule<It, MyEvents::Atom()> atom;

        qi::symbols<char, MyEvents::Kind> kind;
        qi::symbols<char, double> values;
    };
}

template <typename It>
bool parse_spirit(It b, It e, MyEvents::LogRecords& into, StringTable& strings) {
    QiParsers::LogParser<It> parser(strings); // TODO optimize by not reconstructing all parser rules each time

    return parse(b, e, parser, into);
}

bool parse_logfile(char const* fname, MyEvents::LogRecords& into, StringTable& strings) {
    boost::iostreams::filtering_istream is;
    is.push(boost::iostreams::gzip_decompressor());

    std::ifstream ifs(fname, std::ios::binary);
    is.push(ifs);

    boost::spirit::istream_iterator f(is >> std::noskipws), l;
    return parse_spirit(f, l, into, strings);
}

namespace MyEvents { // for debug/demo
    using boost::fusion::operator<<;

    static inline std::ostream& operator<<(std::ostream& os, Kind k) {
        switch(k) {
            case LOCATION: return os << "LOCATION";
            case SLOPE:    return os << "SLOPE";
            case GEAR:     return os << "GEAR";
            case DIR:      return os << "DIR";
        }
        return os;
    }
}

int main(int argc, char **argv) {
    StringTable strings;
    MyEvents::LogRecords records;

    for (char** arg = argv+1; *arg && (argv+argc != arg); ++arg) {
        bool ok = parse_logfile(*arg, records, strings);

        std::cout 
            << "Parsing " << *arg << (ok?" - success" : " - errors")
            << " (" << records.size() << " records total)\n";
    }

    for (MyEvents::LogRecords::const_iterator it = records.begin(); it != records.end(); ++it)
        std::cout << *it << "\n"; 

    std::cout << "Interned strings: " << strings.index.size() << "\n";
    std::cout << "Table backing: '";
    std::copy(strings.backing.begin(), strings.backing.end(), std::ostreambuf_iterator<char>(std::cout));
    std::cout << "'\n";
    for (StringTable::Index::const_iterator it = strings.index.begin(); it != strings.index.end(); ++it) {
        std::cout << " entry - " << *it << "\n";
    }
}

当使用2个输入文件运行时,第二个输入文件稍有变化:

zcat input.gz | sed 's/[16] - Km/ - Km/' | gzip > second.gz

打印

Parsing input.gz - success (9 records total)
Parsing second.gz - success (18 records total)
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-11 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-11 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-11 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-16 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-16 90.4 (GEAR -1))
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-1 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-1 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-1 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-1 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-1 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-1 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-1 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-1 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-1 90.4 (GEAR -1))

有趣的是在实习字符串统计数据中:

Interned strings: 4
Table backing: 'A-11B-16'
 entry - A-1
 entry - A-11
 entry - B-1
 entry - B-16

请注意B-1A-1如何将A-11B-16作为子字符串进行重复数据删除,因为它们已经被实习。预先设置stringtable可能有利于实现最佳重用。

各种备注

我没有很多减少编译时间的技巧。我只是将所有Spirit的东西放在一个单独的TU中并接受编译时间。毕竟,这是关于交易运行时性能的编译时间。

关于字符串实习,您最好使用flat_set<char const*>服务,这样您只需按需构建一个具有特定长度的原子。

如果所有字符串都很小,那么只使用小字符串优化就可以(远)更好。

我会让你做比较基准测试,你可能想继续使用你自己的解压缩+ const char *迭代器。这主要是为了表明Boost有它,你不需要“一次读取整个文件”。

事实上,在这个主题上,您可能希望将结果存储在内存映射文件中,因此即使您超出了物理内存限制,您也会很乐意继续工作。

多指数&amp;查询

您可以在我之前的回答中找到有关此问题的具体示例:BONUS: Multi-Index

特别注意通过引用获取索引的方法:

Indexing::Table idx(events.begin(), events.end());

这也可以用于将结果集存储在另一个(索引)容器中,以便重复/进一步处理。