关于解析复杂日志的几个问题后,我终于被告知这样做的最佳方法。
现在的问题是,是否有某种方法可以提高性能和/或减少内存使用量,甚至是编译时间。我会要求答案满足这些限制:
MS VS 2010(不完全是c ++ 11,只是实现了一些功能:auto,lambdas ......)和boost 1.53(这里唯一的问题是string_view
仍然不可用,但是使用string_ref
仍然有效,甚至可能会在将来弃用它。)
日志被压缩,并使用一个打开的库直接解压缩到ram内存,该库输出一个旧的原始C" char"数组,所以不值得使用std::string
,因为内存已经由库分配。它们有数千个并且它们填充了几GB,因此将它们保存在内存中是不可取的。我的意思是,由于在解析日志后删除了日志,因此无法使用string_view
。
将日期字符串解析为POSIX时间可能是个好主意。只有两个评论:避免为此分配一个字符串应该是有趣的,因为我知道POSIX时间不允许ms,所以它们应该保存在另一个额外的变量中。
通过日志(道路变量p.e.)重复一些字符串。使用一些flyweight模式(它的boost实现)来减少内存可能会很有趣,即使记住这会有性能成本。
使用模板库时编译时间很麻烦。我非常感谢任何有助于减少它们的调整:可能将语法分成子语法?也许使用预编译的标题?
最后一次使用此方法是对任何事件进行查询,例如:获取所有GEAR事件(值和时间),并在固定间隔内或每次事件发生时记录所有汽车变量。日志中有两种类型的记录:pure" Location"记录和"位置+事件"记录(我的意思是,每次解析一个事件时,也必须保存该位置)。将它们分成两个向量可以实现快速查询,但会降低解析速度。仅使用公共向量允许快速解析但减慢查询速度。对此有何想法?也许提升多索引容器会有所帮助吗?
请不要犹豫,提供任何解决方案或更改您认为可能有助于实现目标的任何内容。
//#define BOOST_SPIRIT_DEBUG
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <cstring> // strlen
typedef char const* It;
namespace MyEvents {
enum Kind { LOCATION, SLOPE, GEAR, DIR };
struct Event {
Kind kind;
double value;
};
struct LogRecord {
int driver;
double time;
double vel;
double km;
std::string date;
std::string road;
Event event;
};
typedef std::vector<LogRecord> LogRecords;
}
BOOST_FUSION_ADAPT_STRUCT(MyEvents::Event,
(MyEvents::Kind, kind)
(double, value))
BOOST_FUSION_ADAPT_STRUCT(MyEvents::LogRecord,
(std::string, date)
(double, time)
(int, driver)
(double, vel)
(std::string, road)
(double, km)
(MyEvents::Event, event))
namespace qi = boost::spirit::qi;
namespace QiParsers {
template <typename It>
struct LogParser : qi::grammar<It, MyEvents::LogRecords()> {
LogParser() : LogParser::base_type(start) {
using namespace qi;
kind.add
("SLOPE", MyEvents::SLOPE)
("GEAR", MyEvents::GEAR)
("DIR", MyEvents::DIR);
values.add("G1", 1.0)
("G2", 2.0)
("REVERSE", -1.0)
("NORTH", 1.0)
("EAST", 2.0)
("WEST", 3.0)
("SOUTH", 4.0);
MyEvents::Event null_event = {MyEvents::LOCATION, 0.0};
line_record
= '[' >> raw[repeat(4)[digit] >> '-' >> repeat(3)[alpha] >> '-' >> repeat(2)[digit] >> ' ' >>
repeat(2)[digit] >> ':' >> repeat(2)[digit] >> ':' >> repeat(2)[digit] >> '.' >> repeat(6)[digit]] >> "]"
>> " - " >> double_ >> " s"
>> " => Driver: " >> int_
>> " - Speed: " >> double_
>> " - Road: " >> raw[+graph]
>> " - Km: " >> double_
>> (" - " >> kind >> ": " >> (double_ | values) | attr(null_event));
start = line_record % eol;
//BOOST_SPIRIT_DEBUG_NODES((start)(line_record))
}
private:
qi::rule<It, MyEvents::LogRecords()> start;
qi::rule<It, MyEvents::LogRecord()> line_record;
qi::symbols<char, MyEvents::Kind> kind;
qi::symbols<char, double> values;
};
}
MyEvents::LogRecords parse_spirit(It b, It e) {
static QiParsers::LogParser<It> const parser;
MyEvents::LogRecords records;
parse(b, e, parser, records);
return records;
}
static char input[] =
"[2018-Mar-13 13:13:59.580482] - 0.200 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.0 - SLOPE: 5.5\n\
[2018-Mar-13 13:14:01.170203] - 1.790 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.0 - GEAR: G1\n\
[2018-Mar-13 13:14:01.170203] - 1.790 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.0 - DIR: NORTH\n\
[2018-Mar-13 13:14:01.170203] - 1.790 s => Driver: 0 - Speed: 0.1 - Road: A-11 - Km: 90.0\n\
[2018-Mar-13 13:14:01.170203] - 1.980 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.1 - GEAR: G2\n\
[2018-Mar-13 13:14:01.819966] - 2.440 s => Driver: 0 - Speed: 0.1 - Road: B-16 - Km: 90.2\n\
[2018-Mar-13 13:14:01.819966] - 2.440 s => Driver: 0 - Speed: 0.1 - Road: B-16 - Km: 90.2 - DIR: EAST\n\
[2018-Mar-13 13:15:01.819966] - 3.440 s => Driver: 0 - Speed: 0.2 - Road: B-16 - Km: 90.3 - SLOPE: -10\n\
[2018-Mar-13 13:14:01.170203] - 1.980 s => Driver: 0 - Speed: 0.0 - Road: B-16 - Km: 90.4 - GEAR: REVERSE\n";
static const size_t len = strlen(input);
namespace MyEvents { // for debug/demo
using boost::fusion::operator<<;
static inline std::ostream& operator<<(std::ostream& os, Kind k) {
switch(k) {
case LOCATION: return os << "LOCATION";
case SLOPE: return os << "SLOPE";
case GEAR: return os << "GEAR";
case DIR: return os << "DIR";
}
return os;
}
}
int main() {
MyEvents::LogRecords records = parse_spirit(input, input+len);
std::cout << "Parsed: " << records.size() << " records\n";
for (MyEvents::LogRecords::const_iterator it = records.begin(); it != records.end(); ++it)
std::cout << *it << "\n";
return 0;
}
答案 0 :(得分:3)
是string_ref
基本相同,但在某些时候使用与std::string_view
略有不同的界面
存储POSIX时间非常简单:
#include <boost/date_time/posix_time/posix_time_io.hpp>
接下来,替换类型:
typedef boost::posix_time::ptime Timestamp;
struct LogRecord {
int driver;
double time;
double vel;
double km;
Timestamp date; // << HERE using Timestamp now
std::string road;
Event event;
};
将解析器简化为:
'[' >> stream >> ']'
Parsed: 9 records
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-11 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-11 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-11 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-16 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-16 90.4 (GEAR -1))
您也可以使用IOStreams透明地解压缩输入:
int main(int argc, char **argv) {
MyEvents::LogRecords records;
for (char** arg = argv+1; *arg && (argv+argc != arg); ++arg) {
bool ok = parse_logfile(*arg, records);
std::cout
<< "Parsing " << *arg << (ok?" - success" : " - errors")
<< " (" << records.size() << " records total)\n";
}
for (MyEvents::LogRecords::const_iterator it = records.begin(); it != records.end(); ++it)
std::cout << *it << "\n";
}
parse_logfile
然后可以实现为:
template <typename It>
bool parse_spirit(It b, It e, MyEvents::LogRecords& into) {
static QiParsers::LogParser<It> const parser;
return parse(b, e, parser, into);
}
bool parse_logfile(char const* fname, MyEvents::LogRecords& into) {
boost::iostreams::filtering_istream is;
is.push(boost::iostreams::gzip_decompressor());
std::ifstream ifs(fname, std::ios::binary);
is.push(ifs);
boost::spirit::istream_iterator f(is >> std::noskipws), l;
return parse_spirit(f, l, into);
}
注意:该库具有zlib,gzip和bzip2解压缩程序。我选择了gzip进行演示
Parsing input.gz - success (9 records total)
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-11 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-11 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-11 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-16 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-16 90.4 (GEAR -1))
“Interned”字符串或“Atoms”是减少字符串分配的常用方法。你可以使用Boost Flyweight,但根据我的经验,要做到这一点有点复杂。那么,为什么不创建自己的抽象:
struct StringTable {
typedef boost::string_ref Atom;
typedef boost::container::flat_set<Atom> Index;
typedef std::deque<char> Store;
/* An insert in the middle of the deque invalidates all the iterators and
* references to elements of the deque. An insert at either end of the
* deque invalidates all the iterators to the deque, but has no effect on
* the validity of references to elements of the deque.
*/
Store backing;
Index index;
Atom intern(boost::string_ref const& key) {
Index::const_iterator it = index.find(key);
if (it == index.end()) {
Store::const_iterator match = std::search(
backing.begin(), backing.end(),
key.begin(), key.end());
if (match == backing.end()) {
size_t offset = backing.size();
backing.insert(backing.end(), key.begin(), key.end());
match = backing.begin() + offset;
}
it = index.insert(Atom(&*match, key.size())).first;
}
// return the Atom from backing store
return *it;
}
};
现在,我们需要将其集成到解析器中。我建议使用语义动作
注意:特征在这里仍然有用,但它们是静态的,这需要
StringTable
是全局的,这是我永远不会做出的选择......除非绝对有义务
首先,改变Ast:
struct LogRecord {
int driver;
double time;
double vel;
double km;
Timestamp date;
Atom road; // << HERE using Atom now
Event event;
};
接下来,让我们创建一个创建这样一个原子的规则:
qi::rule<It, MyEvents::Atom()> atom;
atom = raw[+graph][_val = intern_(_1)];
当然,这引出了如何实现语义行为的问题:
struct intern_f {
StringTable& _table;
typedef StringTable::Atom result_type;
explicit intern_f(StringTable& table) : _table(table) {}
StringTable::Atom operator()(boost::iterator_range<It> const& range) const {
return _table.intern(sequential(range));
}
private:
// be more efficient if It is const char*
static boost::string_ref sequential(boost::iterator_range<const char*> const& range) {
return boost::string_ref(range.begin(), range.size());
}
template <typename OtherIt>
static std::string sequential(boost::iterator_range<OtherIt> const& range) {
return std::string(range.begin(), range.end());
}
};
boost::phoenix::function<intern_f> intern_;
语法的构造函数将intern_
仿函数挂钩到传入的StringTable&
。
<强> Live On Coliru 强>
//#define BOOST_SPIRIT_DEBUG
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/date_time/posix_time/posix_time_io.hpp>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/gzip.hpp>
#include <boost/utility/string_ref.hpp>
#include <boost/container/flat_set.hpp>
#include <fstream>
#include <cstring> // strlen
struct StringTable {
typedef boost::string_ref Atom;
typedef boost::container::flat_set<Atom> Index;
typedef std::deque<char> Store;
/* An insert in the middle of the deque invalidates all the iterators and
* references to elements of the deque. An insert at either end of the
* deque invalidates all the iterators to the deque, but has no effect on
* the validity of references to elements of the deque.
*/
Store backing;
Index index;
Atom intern(boost::string_ref const& key) {
Index::const_iterator it = index.find(key);
if (it == index.end()) {
Store::const_iterator match = std::search(
backing.begin(), backing.end(),
key.begin(), key.end());
if (match == backing.end()) {
size_t offset = backing.size();
backing.insert(backing.end(), key.begin(), key.end());
match = backing.begin() + offset;
}
it = index.insert(Atom(&*match, key.size())).first;
}
// return the Atom from backing store
return *it;
}
};
namespace MyEvents {
enum Kind { LOCATION, SLOPE, GEAR, DIR };
struct Event {
Kind kind;
double value;
};
typedef boost::posix_time::ptime Timestamp;
typedef StringTable::Atom Atom;
struct LogRecord {
int driver;
double time;
double vel;
double km;
Timestamp date;
Atom road;
Event event;
};
typedef std::vector<LogRecord> LogRecords;
}
BOOST_FUSION_ADAPT_STRUCT(MyEvents::Event,
(MyEvents::Kind, kind)
(double, value))
BOOST_FUSION_ADAPT_STRUCT(MyEvents::LogRecord,
(MyEvents::Timestamp, date)
(double, time)
(int, driver)
(double, vel)
(MyEvents::Atom, road)
(double, km)
(MyEvents::Event, event))
namespace qi = boost::spirit::qi;
namespace QiParsers {
template <typename It>
struct LogParser : qi::grammar<It, MyEvents::LogRecords()> {
LogParser(StringTable& strings) : LogParser::base_type(start), intern_(intern_f(strings)) {
using namespace qi;
kind.add
("SLOPE", MyEvents::SLOPE)
("GEAR", MyEvents::GEAR)
("DIR", MyEvents::DIR);
values.add("G1", 1.0)
("G2", 2.0)
("REVERSE", -1.0)
("NORTH", 1.0)
("EAST", 2.0)
("WEST", 3.0)
("SOUTH", 4.0);
MyEvents::Event null_event = {MyEvents::LOCATION, 0.0};
atom = raw[+graph][_val = intern_(_1)];
line_record
= '[' >> stream >> ']'
>> " - " >> double_ >> " s"
>> " => Driver: " >> int_
>> " - Speed: " >> double_
>> " - Road: " >> atom
>> " - Km: " >> double_
>> (" - " >> kind >> ": " >> (double_ | values) | attr(null_event));
start = line_record % eol;
BOOST_SPIRIT_DEBUG_NODES((start)(line_record)(atom))
}
private:
struct intern_f {
StringTable& _table;
typedef StringTable::Atom result_type;
explicit intern_f(StringTable& table) : _table(table) {}
StringTable::Atom operator()(boost::iterator_range<It> const& range) const {
return _table.intern(sequential(range));
}
private:
// be more efficient if It is const char*
static boost::string_ref sequential(boost::iterator_range<const char*> const& range) {
return boost::string_ref(range.begin(), range.size());
}
template <typename OtherIt>
static std::string sequential(boost::iterator_range<OtherIt> const& range) {
return std::string(range.begin(), range.end());
}
};
boost::phoenix::function<intern_f> intern_;
qi::rule<It, MyEvents::LogRecords()> start;
qi::rule<It, MyEvents::LogRecord()> line_record;
qi::rule<It, MyEvents::Atom()> atom;
qi::symbols<char, MyEvents::Kind> kind;
qi::symbols<char, double> values;
};
}
template <typename It>
bool parse_spirit(It b, It e, MyEvents::LogRecords& into, StringTable& strings) {
QiParsers::LogParser<It> parser(strings); // TODO optimize by not reconstructing all parser rules each time
return parse(b, e, parser, into);
}
bool parse_logfile(char const* fname, MyEvents::LogRecords& into, StringTable& strings) {
boost::iostreams::filtering_istream is;
is.push(boost::iostreams::gzip_decompressor());
std::ifstream ifs(fname, std::ios::binary);
is.push(ifs);
boost::spirit::istream_iterator f(is >> std::noskipws), l;
return parse_spirit(f, l, into, strings);
}
namespace MyEvents { // for debug/demo
using boost::fusion::operator<<;
static inline std::ostream& operator<<(std::ostream& os, Kind k) {
switch(k) {
case LOCATION: return os << "LOCATION";
case SLOPE: return os << "SLOPE";
case GEAR: return os << "GEAR";
case DIR: return os << "DIR";
}
return os;
}
}
int main(int argc, char **argv) {
StringTable strings;
MyEvents::LogRecords records;
for (char** arg = argv+1; *arg && (argv+argc != arg); ++arg) {
bool ok = parse_logfile(*arg, records, strings);
std::cout
<< "Parsing " << *arg << (ok?" - success" : " - errors")
<< " (" << records.size() << " records total)\n";
}
for (MyEvents::LogRecords::const_iterator it = records.begin(); it != records.end(); ++it)
std::cout << *it << "\n";
std::cout << "Interned strings: " << strings.index.size() << "\n";
std::cout << "Table backing: '";
std::copy(strings.backing.begin(), strings.backing.end(), std::ostreambuf_iterator<char>(std::cout));
std::cout << "'\n";
for (StringTable::Index::const_iterator it = strings.index.begin(); it != strings.index.end(); ++it) {
std::cout << " entry - " << *it << "\n";
}
}
当使用2个输入文件运行时,第二个输入文件稍有变化:
zcat input.gz | sed 's/[16] - Km/ - Km/' | gzip > second.gz
打印
Parsing input.gz - success (9 records total)
Parsing second.gz - success (18 records total)
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-11 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-11 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-11 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-16 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-16 90.4 (GEAR -1))
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-1 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-1 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-1 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-1 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-1 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-1 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-1 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-1 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-1 90.4 (GEAR -1))
有趣的是在实习字符串统计数据中:
Interned strings: 4
Table backing: 'A-11B-16'
entry - A-1
entry - A-11
entry - B-1
entry - B-16
请注意B-1
和A-1
如何将A-11
和B-16
作为子字符串进行重复数据删除,因为它们已经被实习。预先设置stringtable可能有利于实现最佳重用。
我没有很多减少编译时间的技巧。我只是将所有Spirit的东西放在一个单独的TU中并接受编译时间。毕竟,这是关于交易运行时性能的编译时间。
关于字符串实习,您最好使用flat_set<char const*>
服务,这样您只需按需构建一个具有特定长度的原子。
如果所有字符串都很小,那么只使用小字符串优化就可以(远)更好。
我会让你做比较基准测试,你可能想继续使用你自己的解压缩+ const char *迭代器。这主要是为了表明Boost有它,你不需要“一次读取整个文件”。
事实上,在这个主题上,您可能希望将结果存储在内存映射文件中,因此即使您超出了物理内存限制,您也会很乐意继续工作。
您可以在我之前的回答中找到有关此问题的具体示例:BONUS: Multi-Index
特别注意通过引用获取索引的方法:
Indexing::Table idx(events.begin(), events.end());
这也可以用于将结果集存储在另一个(索引)容器中,以便重复/进一步处理。