Question

关于解析复杂日志的几个问题后，我终于被告知这样做的最佳方法。

现在的问题是，是否有某种方法可以提高性能和/或减少内存使用量，甚至是编译时间。我会要求答案满足这些限制：

MS VS 2010（不完全是c ++ 11，只是实现了一些功能：auto，lambdas ......）和boost 1.53（这里唯一的问题是string_view仍然不可用，但是使用string_ref仍然有效，甚至可能会在将来弃用它。）
日志被压缩，并使用一个打开的库直接解压缩到ram内存，该库输出一个旧的原始C＆＃34; char＆＃34;数组，所以不值得使用std::string，因为内存已经由库分配。它们有数千个并且它们填充了几GB，因此将它们保存在内存中是不可取的。我的意思是，由于在解析日志后删除了日志，因此无法使用string_view。
将日期字符串解析为POSIX时间可能是个好主意。只有两个评论：避免为此分配一个字符串应该是有趣的，因为我知道POSIX时间不允许ms，所以它们应该保存在另一个额外的变量中。
通过日志（道路变量p.e.）重复一些字符串。使用一些flyweight模式（它的boost实现）来减少内存可能会很有趣，即使记住这会有性能成本。
使用模板库时编译时间很麻烦。我非常感谢任何有助于减少它们的调整：可能将语法分成子语法？也许使用预编译的标题？
最后一次使用此方法是对任何事件进行查询，例如：获取所有GEAR事件（值和时间），并在固定间隔内或每次事件发生时记录所有汽车变量。日志中有两种类型的记录：pure＆＃34; Location＆＃34;记录和＆＃34;位置+事件＆＃34;记录（我的意思是，每次解析一个事件时，也必须保存该位置）。将它们分成两个向量可以实现快速查询，但会降低解析速度。仅使用公共向量允许快速解析但减慢查询速度。对此有何想法？也许提升多索引容器会有所帮助吗？

请不要犹豫，提供任何解决方案或更改您认为可能有助于实现目标的任何内容。

//#define BOOST_SPIRIT_DEBUG
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <cstring> // strlen

typedef char const* It;

namespace MyEvents {
    enum Kind { LOCATION, SLOPE, GEAR, DIR };

    struct Event {
        Kind kind;
        double value;
    };

    struct LogRecord {
        int driver;        
        double time;
        double vel;
        double km;
        std::string date;
        std::string road;
        Event event;
    };

    typedef std::vector<LogRecord> LogRecords;
}

BOOST_FUSION_ADAPT_STRUCT(MyEvents::Event,
    (MyEvents::Kind, kind)
    (double, value))


BOOST_FUSION_ADAPT_STRUCT(MyEvents::LogRecord,
        (std::string, date)
        (double, time)
        (int, driver)
        (double, vel)
        (std::string, road)
        (double, km)
        (MyEvents::Event, event))

namespace qi = boost::spirit::qi;

namespace QiParsers {
    template <typename It>
    struct LogParser : qi::grammar<It, MyEvents::LogRecords()> {

        LogParser() : LogParser::base_type(start) {
            using namespace qi;

            kind.add
                ("SLOPE", MyEvents::SLOPE)
                ("GEAR", MyEvents::GEAR)
                ("DIR", MyEvents::DIR);

            values.add("G1", 1.0)
                      ("G2", 2.0)
                      ("REVERSE", -1.0)
                      ("NORTH", 1.0)
                      ("EAST", 2.0)
                      ("WEST", 3.0)
                      ("SOUTH", 4.0);

            MyEvents::Event null_event = {MyEvents::LOCATION, 0.0};

            line_record
                = '[' >> raw[repeat(4)[digit] >> '-' >> repeat(3)[alpha] >> '-' >> repeat(2)[digit] >> ' ' >> 
                             repeat(2)[digit] >> ':' >> repeat(2)[digit] >> ':' >> repeat(2)[digit] >> '.' >> repeat(6)[digit]] >> "]"
                >> " - " >> double_ >> " s"
                >> " => Driver: "  >> int_
                >> " - Speed: "    >> double_
                >> " - Road: "     >> raw[+graph]
                >> " - Km: "       >> double_
                >> (" - " >> kind >> ": " >> (double_ | values) | attr(null_event));

            start = line_record % eol;

            //BOOST_SPIRIT_DEBUG_NODES((start)(line_record))
        }

      private:
        qi::rule<It, MyEvents::LogRecords()> start;

        qi::rule<It, MyEvents::LogRecord()> line_record;

        qi::symbols<char, MyEvents::Kind> kind;
        qi::symbols<char, double> values;
    };
}

MyEvents::LogRecords parse_spirit(It b, It e) {
    static QiParsers::LogParser<It> const parser;

    MyEvents::LogRecords records;
    parse(b, e, parser, records);

    return records;
}

static char input[] = 
"[2018-Mar-13 13:13:59.580482] - 0.200 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.0 - SLOPE: 5.5\n\
[2018-Mar-13 13:14:01.170203] - 1.790 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.0 - GEAR: G1\n\
[2018-Mar-13 13:14:01.170203] - 1.790 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.0 - DIR: NORTH\n\
[2018-Mar-13 13:14:01.170203] - 1.790 s => Driver: 0 - Speed: 0.1 - Road: A-11 - Km: 90.0\n\
[2018-Mar-13 13:14:01.170203] - 1.980 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.1 - GEAR: G2\n\
[2018-Mar-13 13:14:01.819966] - 2.440 s => Driver: 0 - Speed: 0.1 - Road: B-16 - Km: 90.2\n\
[2018-Mar-13 13:14:01.819966] - 2.440 s => Driver: 0 - Speed: 0.1 - Road: B-16 - Km: 90.2 - DIR: EAST\n\
[2018-Mar-13 13:15:01.819966] - 3.440 s => Driver: 0 - Speed: 0.2 - Road: B-16 - Km: 90.3 - SLOPE: -10\n\
[2018-Mar-13 13:14:01.170203] - 1.980 s => Driver: 0 - Speed: 0.0 - Road: B-16 - Km: 90.4 - GEAR: REVERSE\n";
static const size_t len = strlen(input);

namespace MyEvents { // for debug/demo
    using boost::fusion::operator<<;

    static inline std::ostream& operator<<(std::ostream& os, Kind k) {
        switch(k) {
            case LOCATION: return os << "LOCATION";
            case SLOPE:    return os << "SLOPE";
            case GEAR:     return os << "GEAR";
            case DIR:      return os << "DIR";
        }
        return os;
    }
}

int main() {
    MyEvents::LogRecords records = parse_spirit(input, input+len);
    std::cout << "Parsed: " << records.size() << " records\n";

    for (MyEvents::LogRecords::const_iterator it = records.begin(); it != records.end(); ++it)
        std::cout << *it << "\n"; 

    return 0;
}

Answer 1

是string_ref基本相同，但在某些时候使用与std::string_view略有不同的界面

修订版1：POSIX时间

存储POSIX时间非常简单：

#include <boost/date_time/posix_time/posix_time_io.hpp>

接下来，替换类型：

typedef boost::posix_time::ptime Timestamp;

struct LogRecord {
    int driver;
    double time;
    double vel;
    double km;
    Timestamp date;    // << HERE using Timestamp now
    std::string road;
    Event event;
};

将解析器简化为：

'[' >> stream >> ']'

打印 Live On Coliru

Parsed: 9 records
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-11 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-11 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-11 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-16 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-16 90.4 (GEAR -1))

修订版＃2：压缩文件

您也可以使用IOStreams透明地解压缩输入：

int main(int argc, char **argv) {
    MyEvents::LogRecords records;

    for (char** arg = argv+1; *arg && (argv+argc != arg); ++arg) {
        bool ok = parse_logfile(*arg, records);

        std::cout 
            << "Parsing " << *arg << (ok?" - success" : " - errors")
            << " (" << records.size() << " records total)\n";
    }

    for (MyEvents::LogRecords::const_iterator it = records.begin(); it != records.end(); ++it)
        std::cout << *it << "\n"; 
}

parse_logfile然后可以实现为：

template <typename It>
bool parse_spirit(It b, It e, MyEvents::LogRecords& into) {
    static QiParsers::LogParser<It> const parser;

    return parse(b, e, parser, into);
}

bool parse_logfile(char const* fname, MyEvents::LogRecords& into) {
    boost::iostreams::filtering_istream is;
    is.push(boost::iostreams::gzip_decompressor());

    std::ifstream ifs(fname, std::ios::binary);
    is.push(ifs);

    boost::spirit::istream_iterator f(is >> std::noskipws), l;
    return parse_spirit(f, l, into);
}

注意：该库具有zlib，gzip和bzip2解压缩程序。我选择了gzip进行演示

打印 Live On Coliru

Parsing input.gz - success (9 records total)
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-11 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-11 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-11 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-16 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-16 90.4 (GEAR -1))

修订版3：字符串实习

“Interned”字符串或“Atoms”是减少字符串分配的常用方法。你可以使用Boost Flyweight，但根据我的经验，要做到这一点有点复杂。那么，为什么不创建自己的抽象：

struct StringTable {
    typedef boost::string_ref Atom;
    typedef boost::container::flat_set<Atom> Index;
    typedef std::deque<char> Store;

    /* An insert in the middle of the deque invalidates all the iterators and
     * references to elements of the deque. An insert at either end of the
     * deque invalidates all the iterators to the deque, but has no effect on
     * the validity of references to elements of the deque.
     */
    Store backing;
    Index index;

    Atom intern(boost::string_ref const& key) {
        Index::const_iterator it = index.find(key);

        if (it == index.end()) {
            Store::const_iterator match = std::search(
                    backing.begin(), backing.end(),
                    key.begin(), key.end());

            if (match == backing.end()) {
                size_t offset = backing.size();
                backing.insert(backing.end(), key.begin(), key.end());
                match = backing.begin() + offset;
            }

            it = index.insert(Atom(&*match, key.size())).first;
        }
        // return the Atom from backing store
        return *it;
    }
};

现在，我们需要将其集成到解析器中。我建议使用语义动作

注意：特征在这里仍然有用，但它们是静态的，这需要StringTable是全局的，这是我永远不会做出的选择......除非绝对有义务

首先，改变Ast：

struct LogRecord {
    int driver;
    double time;
    double vel;
    double km;
    Timestamp date;
    Atom road;       // << HERE using Atom now
    Event event;
};

接下来，让我们创建一个创建这样一个原子的规则：

qi::rule<It, MyEvents::Atom()> atom;

atom = raw[+graph][_val = intern_(_1)];

当然，这引出了如何实现语义行为的问题：

struct intern_f {
    StringTable& _table;

    typedef StringTable::Atom result_type;
    explicit intern_f(StringTable& table) : _table(table) {}

    StringTable::Atom operator()(boost::iterator_range<It> const& range) const {
        return _table.intern(sequential(range));
    }

  private:
    // be more efficient if It is const char*
    static boost::string_ref sequential(boost::iterator_range<const char*> const& range) {
        return boost::string_ref(range.begin(), range.size());
    }
    template <typename OtherIt>
    static std::string sequential(boost::iterator_range<OtherIt> const& range) {
        return std::string(range.begin(), range.end());
    }
};
boost::phoenix::function<intern_f> intern_;

语法的构造函数将intern_仿函数挂钩到传入的StringTable&。

完整演示

<强> Live On Coliru

//#define BOOST_SPIRIT_DEBUG
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/date_time/posix_time/posix_time_io.hpp>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/gzip.hpp>
#include <boost/utility/string_ref.hpp>
#include <boost/container/flat_set.hpp>
#include <fstream>
#include <cstring> // strlen

struct StringTable {
    typedef boost::string_ref Atom;
    typedef boost::container::flat_set<Atom> Index;
    typedef std::deque<char> Store;

    /* An insert in the middle of the deque invalidates all the iterators and
     * references to elements of the deque. An insert at either end of the
     * deque invalidates all the iterators to the deque, but has no effect on
     * the validity of references to elements of the deque.
     */
    Store backing;
    Index index;

    Atom intern(boost::string_ref const& key) {
        Index::const_iterator it = index.find(key);

        if (it == index.end()) {
            Store::const_iterator match = std::search(
                    backing.begin(), backing.end(),
                    key.begin(), key.end());

            if (match == backing.end()) {
                size_t offset = backing.size();
                backing.insert(backing.end(), key.begin(), key.end());
                match = backing.begin() + offset;
            }

            it = index.insert(Atom(&*match, key.size())).first;
        }
        // return the Atom from backing store
        return *it;
    }
};

namespace MyEvents {
    enum Kind { LOCATION, SLOPE, GEAR, DIR };

    struct Event {
        Kind kind;
        double value;
    };

    typedef boost::posix_time::ptime Timestamp;
    typedef StringTable::Atom Atom;

    struct LogRecord {
        int driver;
        double time;
        double vel;
        double km;
        Timestamp date;
        Atom road;
        Event event;
    };

    typedef std::vector<LogRecord> LogRecords;
}

BOOST_FUSION_ADAPT_STRUCT(MyEvents::Event,
        (MyEvents::Kind, kind)
        (double, value))

BOOST_FUSION_ADAPT_STRUCT(MyEvents::LogRecord,
        (MyEvents::Timestamp, date)
        (double, time)
        (int, driver)
        (double, vel)
        (MyEvents::Atom, road)
        (double, km)
        (MyEvents::Event, event))

namespace qi = boost::spirit::qi;

namespace QiParsers {
    template <typename It>
    struct LogParser : qi::grammar<It, MyEvents::LogRecords()> {

        LogParser(StringTable& strings) : LogParser::base_type(start), intern_(intern_f(strings)) {
            using namespace qi;

            kind.add
                ("SLOPE", MyEvents::SLOPE)
                ("GEAR", MyEvents::GEAR)
                ("DIR", MyEvents::DIR);

            values.add("G1", 1.0)
                      ("G2", 2.0)
                      ("REVERSE", -1.0)
                      ("NORTH", 1.0)
                      ("EAST", 2.0)
                      ("WEST", 3.0)
                      ("SOUTH", 4.0);

            MyEvents::Event null_event = {MyEvents::LOCATION, 0.0};

            atom = raw[+graph][_val = intern_(_1)];

            line_record
                = '[' >> stream >> ']'
                >> " - " >> double_ >> " s"
                >> " => Driver: "  >> int_
                >> " - Speed: "    >> double_
                >> " - Road: "     >> atom
                >> " - Km: "       >> double_
                >> (" - " >> kind >> ": " >> (double_ | values) | attr(null_event));

            start = line_record % eol;

            BOOST_SPIRIT_DEBUG_NODES((start)(line_record)(atom))
        }

      private:
        struct intern_f {
            StringTable& _table;

            typedef StringTable::Atom result_type;
            explicit intern_f(StringTable& table) : _table(table) {}

            StringTable::Atom operator()(boost::iterator_range<It> const& range) const {
                return _table.intern(sequential(range));
            }

          private:
            // be more efficient if It is const char*
            static boost::string_ref sequential(boost::iterator_range<const char*> const& range) {
                return boost::string_ref(range.begin(), range.size());
            }
            template <typename OtherIt>
            static std::string sequential(boost::iterator_range<OtherIt> const& range) {
                return std::string(range.begin(), range.end());
            }
        };
        boost::phoenix::function<intern_f> intern_;

        qi::rule<It, MyEvents::LogRecords()> start;

        qi::rule<It, MyEvents::LogRecord()> line_record;
        qi::rule<It, MyEvents::Atom()> atom;

        qi::symbols<char, MyEvents::Kind> kind;
        qi::symbols<char, double> values;
    };
}

template <typename It>
bool parse_spirit(It b, It e, MyEvents::LogRecords& into, StringTable& strings) {
    QiParsers::LogParser<It> parser(strings); // TODO optimize by not reconstructing all parser rules each time

    return parse(b, e, parser, into);
}

bool parse_logfile(char const* fname, MyEvents::LogRecords& into, StringTable& strings) {
    boost::iostreams::filtering_istream is;
    is.push(boost::iostreams::gzip_decompressor());

    std::ifstream ifs(fname, std::ios::binary);
    is.push(ifs);

    boost::spirit::istream_iterator f(is >> std::noskipws), l;
    return parse_spirit(f, l, into, strings);
}

namespace MyEvents { // for debug/demo
    using boost::fusion::operator<<;

    static inline std::ostream& operator<<(std::ostream& os, Kind k) {
        switch(k) {
            case LOCATION: return os << "LOCATION";
            case SLOPE:    return os << "SLOPE";
            case GEAR:     return os << "GEAR";
            case DIR:      return os << "DIR";
        }
        return os;
    }
}

int main(int argc, char **argv) {
    StringTable strings;
    MyEvents::LogRecords records;

    for (char** arg = argv+1; *arg && (argv+argc != arg); ++arg) {
        bool ok = parse_logfile(*arg, records, strings);

        std::cout 
            << "Parsing " << *arg << (ok?" - success" : " - errors")
            << " (" << records.size() << " records total)\n";
    }

    for (MyEvents::LogRecords::const_iterator it = records.begin(); it != records.end(); ++it)
        std::cout << *it << "\n"; 

    std::cout << "Interned strings: " << strings.index.size() << "\n";
    std::cout << "Table backing: '";
    std::copy(strings.backing.begin(), strings.backing.end(), std::ostreambuf_iterator<char>(std::cout));
    std::cout << "'\n";
    for (StringTable::Index::const_iterator it = strings.index.begin(); it != strings.index.end(); ++it) {
        std::cout << " entry - " << *it << "\n";
    }
}

当使用2个输入文件运行时，第二个输入文件稍有变化：

zcat input.gz | sed 's/[16] - Km/ - Km/' | gzip > second.gz

打印

Parsing input.gz - success (9 records total)
Parsing second.gz - success (18 records total)
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-11 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-11 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-11 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-16 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-16 90.4 (GEAR -1))
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-1 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-1 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-1 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-1 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-1 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-1 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-1 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-1 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-1 90.4 (GEAR -1))

有趣的是在实习字符串统计数据中：

Interned strings: 4
Table backing: 'A-11B-16'
 entry - A-1
 entry - A-11
 entry - B-1
 entry - B-16

请注意B-1和A-1如何将A-11和B-16作为子字符串进行重复数据删除，因为它们已经被实习。预先设置stringtable可能有利于实现最佳重用。

各种备注

我没有很多减少编译时间的技巧。我只是将所有Spirit的东西放在一个单独的TU中并接受编译时间。毕竟，这是关于交易运行时性能的编译时间。

关于字符串实习，您最好使用flat_set<char const*>服务，这样您只需按需构建一个具有特定长度的原子。

如果所有字符串都很小，那么只使用小字符串优化就可以（远）更好。

我会让你做比较基准测试，你可能想继续使用你自己的解压缩+ const char *迭代器。这主要是为了表明Boost有它，你不需要“一次读取整个文件”。

事实上，在这个主题上，您可能希望将结果存储在内存映射文件中，因此即使您超出了物理内存限制，您也会很乐意继续工作。

多指数＆amp;查询

您可以在我之前的回答中找到有关此问题的具体示例：BONUS: Multi-Index

特别注意通过引用获取索引的方法：

Indexing::Table idx(events.begin(), events.end());

这也可以用于将结果集存储在另一个（索引）容器中，以便重复/进一步处理。

Boost :: Spirit解析器：寻找最大性能和最小内存使用量

1 个答案: