我的目标是在图表中导入25M边缘,该边缘有大约50M个顶点。目标时间:
当前导入速度约为150边/秒。远程连接的速度约为100边/秒。
- 提取20,694,336行(171行/秒) - 20,694,336行 - >加载20,691,830个顶点(171个顶点/秒)总时间:35989762ms [0 警告,4错误]
- 提取20,694,558行(156行/秒) - 20,694,558行 - >加载20,692,053个顶点(156个顶点/秒)总时间:35991185ms [0 警告,4错误]
- 提取20,694,745行(147行/秒) - 20,694,746行 - >加载20,692,240个顶点(147个顶点/秒)总时间:35992453ms [0 警告,4错误]
- 提取20,694,973行(163行/秒) - 20,694,973行 - >加载20,692,467个顶点(162个顶点/秒)总时间:35993851ms [0 警告,4错误]
- 提取20,695,179行(145行/秒) - 20,695,179行 - >加载20,692,673个顶点(145个顶点/秒)总时间:35995262ms [0 警告,4错误]
我尝试在etl配置中启用并行,但看起来它在Orient 2.2.12中完全被破坏(2.1中的多线程更改不一致?)并且在上面的日志中只给出了4个错误。对于本地连接,哑巴并行模式(运行2个以上的ETL过程)也是不可能的。
我的配置:
{
"config": {
"log": "info",
"parallel": true
},
"source": {
"input": {}
},
"extractor": {
"row": {
"multiLine": false
}
},
"transformers": [
{
"code": {
"language": "Javascript",
"code": "(new com.orientechnologies.orient.core.record.impl.ODocument()).fromJSON(input);"
}
},
{
"merge": {
"joinFieldName": "_ref",
"lookup": "Company._ref"
}
},
{
"vertex": {
"class": "Company",
"skipDuplicates": true
}
},
{
"edge": {
"joinFieldName": "with_id",
"lookup": "Person._ref",
"direction": "in",
"class": "Stakeholder",
"edgeFields": {
"_ref": "${input._ref}",
"value_of_share": "${input.value_of_share}"
},
"skipDuplicates": true,
"unresolvedLinkAction": "ERROR"
}
},
{
"field": {
"fieldNames": [
"with_id",
"with_to",
"_type",
"value_of_share"
],
"operation": "remove"
}
}
],
"loader": {
"orientdb": {
"dbURL": "plocal:/mnt/disks/orientdb/orientdb-2.2.12/databases/df",
"dbUser": "admin",
"dbPassword": "admin",
"dbAutoDropIfExists": false,
"dbAutoCreate": false,
"standardElementConstraints": false,
"tx": false,
"wal": false,
"batchCommit": 1000,
"dbType": "graph",
"classes": [
{
"name": "Company",
"extends": "V"
},
{
"name": "Person",
"extends": "V"
},
{
"name": "Stakeholder",
"extends": "E"
}
]
}
}
}
数据样本:
{" _ref":" 1072308006473"" with_to":"人"" with_id":& #34; 010703814320"," _type":" is.stakeholder"," value_of_share":10000.0} {" _ref":& #34; 1075837000095"" with_to":"人"" with_id":" 583600656732"" _type&# 34;:" is.stakeholder"," value_of_share":15925.0} {" _ref":" 1075837000095"," with_to&# 34;:"人"" with_id":" 583600851010"" _type":" is.stakeholder&#34 ;, " value_of_share":33150.0}
服务器的规格是:Google Cloud上的实例,PD-SSD,6CPU,18GB RAM。
顺便说一句,在同一台服务器上,我设法在使用远程连接导入顶点时达到~3k / sec(它仍然太慢,但我目前的数据集可以接受)。
问题是:是否有任何可靠的方法来提高导入速度,让每秒10k插入,或至少5k?我不想关闭索引,它仍然是数百万条记录,而不是数十亿条。
更新
几个小时后,性能继续恶化。
- 提取23,146,912行(56行/秒) - 23,146,912行 - >加载23,144,406个顶点(56个顶点/秒)总时间:60886967ms [0 警告,4错误]
- 提取23,146,981行(69行/秒) - 23,146,981行 - >加载23,144,475个顶点(69个顶点/秒)总时间:60887967ms [0 警告,4错误]
- 提取23,147,075行(39行/秒) - 23,147,075行 - >加载23,144,570个顶点(39个顶点/秒)总时间:60890356ms [0 警告,4错误]