我正在构建一个数据挖掘工具原型,以从多个来源收集数据
1)MySQL db - 2,000,000个顶点20,000,000个边缘 2)自定义数据文件 - - 2,000,000个顶点700,000,000个边缘 3)不同的自定义数据文件 - 300000个顶点500,000,000个边
从性能角度来看,使用嵌入式数据库的ETL或自定义Java加载器是否更好?
很容易将数据从自定义数据文件转换为CSV或JSON
答案 0 :(得分:0)
I'm the ETL maintainer, other than input data format I would take care on which kind of transformation your data sets need AND how many times you need to move data.
ETL is configurable to do some transformations, and you can use it with a plocal db to achieve maximun performance. If you need to reimport frequently, or very complex transoformations, or if your data format can vary time to time, you can write a custom java program.