I have a problem when storing network monitoring log data in OrientDB. They are collected from 2 different sources and stored in csv files (totally there are ~200k records/day) every 1 minute. Let say they are alert and waf which have common properties like sourceIP, destinationIP.
In term of OrientDB data model, I defined 2 vertices including: Alert
, WAF
and 1 edge which is connect
. This edge stores relation info between Alert
and WAF
when they have same sourceIP and destinationIP.
While parsing and inserting them to OrientDB, alert is stored normally. However, I need to retrieve a related alert from Alert
for creating relation and storing in connect
every time when inserting waf. In other words, there are 100k waf records need to be inserted, there will be 100k times retrieving related alert for creating relation in edge.
Certainly, with this implementation, current inserting performance is quite slow. I tried adding 1.1M records which has 19k alert and it took ~43 mins.
I am wondering that my current approach is going wrong way or any better solution for this?
This is an example for my implementation in Java.
void importCSV(OrientGraph graph, List<Alert> alerts, List<WAF> wafs)
{
for(Alert a: alerts) {
graph.createVertex ......
}
for(WAF w: wafs) {
Vertex v = graph.createVertex ......
//check related alert which is not only in alerts list but also in DB.
Vertex alert = findRelatedAlert(Graph, <conditions>);
Edge relation = graph.createEdge(alert, v);
}
graph.commit()
}
Thanks in advance.
答案 0 :(得分:1)
真正的区别在于堆与内存映射使用的虚拟内存之间的正确平衡,特别是在大型数据集(GB,TB等)中,内存缓存结构的数量少于原始IO。 当然,建议是查看此documentation_memory_performance,其中显示了有助于优化配置的信息和示例。 为了加快查询速度,您可以添加索引,其目的是允许更容易地恢复记录(documentation_index)