我有一个用Scala编写的小脚本,用于加载具有100,000,000个样本记录的MongoDB实例。我们的想法是让数据库全部加载,然后进行一些性能测试(并在必要时调整/重新加载)。
问题在于每100,000条记录的加载时间相当线性增加。在我的加载过程开始时,只需4秒即可加载这些记录。现在,在接近6,000,000条记录中,加载相同数量(100,000)需要300到400秒!那要慢两个数量级!查询仍然很快,但按照这个速度,我将永远无法加载我想要的数据量。
如果我用我的所有记录(全部为100,000,000!)编写一个文件,然后使用mongoimport导入整个文件,这会更快吗?或者我的期望是否太高,我使用的数据库超出了它应该处理的范围?
有什么想法?谢谢!
这是我的剧本:
import java.util.Date
import com.mongodb.casbah.Imports._
import com.mongodb.casbah.commons.MongoDBObject
object MongoPopulateTest {
val ONE_HUNDRED_THOUSAND = 100000
val ONE_MILLION = ONE_HUNDRED_THOUSAND * 10
val random = new scala.util.Random(12345)
val connection = MongoConnection()
val db = connection("mongoVolumeTest")
val collection = db("testData")
val INDEX_KEYS = List("A", "G", "E", "F")
def main(args: Array[String]) {
populateCoacs(ONE_MILLION * 100)
}
def populateCoacs(count: Int) {
println("Creating indexes: " + INDEX_KEYS.mkString(", "))
INDEX_KEYS.map(key => collection.ensureIndex(MongoDBObject(key -> 1)))
println("Adding " + count + " records to DB.")
val start = (new Date()).getTime()
var lastBatch = start
for(i <- 0 until count) {
collection.save(makeCoac())
if(i % 100000 == 0 && i != 0) {
println(i + ": " + (((new Date()).getTime() - lastBatch) / 1000.0) + " seconds (" + (new Date()).toString() + ")")
lastBatch = (new Date()).getTime()
}
}
val elapsedSeconds = ((new Date).getTime() - start) / 1000
println("Done. " + count + " COAC rows inserted in " + elapsedSeconds + " seconds.")
}
def makeCoac(): MongoDBObject = {
MongoDBObject(
"A" -> random.nextPrintableChar().toString(),
"B" -> scala.math.abs(random.nextInt()),
"C" -> makeRandomPrintableString(50),
"D" -> (if(random.nextBoolean()) { "Cd" } else { "Cc" }),
"E" -> makeRandomPrintableString(15),
"F" -> makeRandomPrintableString(15),
"G" -> scala.math.abs(random.nextInt()),
"H" -> random.nextBoolean(),
"I" -> (if(random.nextBoolean()) { 41 } else { 31 }),
"J" -> (if(random.nextBoolean()) { "A" } else { "B" }),
"K" -> random.nextFloat(),
"L" -> makeRandomPrintableString(15),
"M" -> makeRandomPrintableString(15),
"N" -> scala.math.abs(random.nextInt()),
"O" -> random.nextFloat(),
"P" -> (if(random.nextBoolean()) { "USD" } else { "GBP" }),
"Q" -> (if(random.nextBoolean()) { "PROCESSED" } else { "UNPROCESSED" }),
"R" -> scala.math.abs(random.nextInt())
)
}
def makeRandomPrintableString(length: Int): String = {
var result = ""
for(i <- 0 until length) {
result += random.nextPrintableChar().toString()
}
result
}
}
这是我脚本的输出:
Creating indexes: A, G, E, F
Adding 100000000 records to DB.
100000: 4.456 seconds (Thu Jul 21 15:18:57 EDT 2011)
200000: 4.155 seconds (Thu Jul 21 15:19:01 EDT 2011)
300000: 4.284 seconds (Thu Jul 21 15:19:05 EDT 2011)
400000: 4.32 seconds (Thu Jul 21 15:19:10 EDT 2011)
500000: 4.597 seconds (Thu Jul 21 15:19:14 EDT 2011)
600000: 4.412 seconds (Thu Jul 21 15:19:19 EDT 2011)
700000: 4.435 seconds (Thu Jul 21 15:19:23 EDT 2011)
800000: 5.919 seconds (Thu Jul 21 15:19:29 EDT 2011)
900000: 4.517 seconds (Thu Jul 21 15:19:33 EDT 2011)
1000000: 4.483 seconds (Thu Jul 21 15:19:38 EDT 2011)
1100000: 4.78 seconds (Thu Jul 21 15:19:43 EDT 2011)
1200000: 9.643 seconds (Thu Jul 21 15:19:52 EDT 2011)
1300000: 25.479 seconds (Thu Jul 21 15:20:18 EDT 2011)
1400000: 30.028 seconds (Thu Jul 21 15:20:48 EDT 2011)
1500000: 24.531 seconds (Thu Jul 21 15:21:12 EDT 2011)
1600000: 18.562 seconds (Thu Jul 21 15:21:31 EDT 2011)
1700000: 28.48 seconds (Thu Jul 21 15:21:59 EDT 2011)
1800000: 29.127 seconds (Thu Jul 21 15:22:29 EDT 2011)
1900000: 25.814 seconds (Thu Jul 21 15:22:54 EDT 2011)
2000000: 16.658 seconds (Thu Jul 21 15:23:11 EDT 2011)
2100000: 24.564 seconds (Thu Jul 21 15:23:36 EDT 2011)
2200000: 32.542 seconds (Thu Jul 21 15:24:08 EDT 2011)
2300000: 30.378 seconds (Thu Jul 21 15:24:39 EDT 2011)
2400000: 21.188 seconds (Thu Jul 21 15:25:00 EDT 2011)
2500000: 23.923 seconds (Thu Jul 21 15:25:24 EDT 2011)
2600000: 46.077 seconds (Thu Jul 21 15:26:10 EDT 2011)
2700000: 104.434 seconds (Thu Jul 21 15:27:54 EDT 2011)
2800000: 23.344 seconds (Thu Jul 21 15:28:17 EDT 2011)
2900000: 17.206 seconds (Thu Jul 21 15:28:35 EDT 2011)
3000000: 19.15 seconds (Thu Jul 21 15:28:54 EDT 2011)
3100000: 14.488 seconds (Thu Jul 21 15:29:08 EDT 2011)
3200000: 20.916 seconds (Thu Jul 21 15:29:29 EDT 2011)
3300000: 69.93 seconds (Thu Jul 21 15:30:39 EDT 2011)
3400000: 81.178 seconds (Thu Jul 21 15:32:00 EDT 2011)
3500000: 93.058 seconds (Thu Jul 21 15:33:33 EDT 2011)
3600000: 168.613 seconds (Thu Jul 21 15:36:22 EDT 2011)
3700000: 189.917 seconds (Thu Jul 21 15:39:32 EDT 2011)
3800000: 200.971 seconds (Thu Jul 21 15:42:53 EDT 2011)
3900000: 207.728 seconds (Thu Jul 21 15:46:21 EDT 2011)
4000000: 213.778 seconds (Thu Jul 21 15:49:54 EDT 2011)
4100000: 219.32 seconds (Thu Jul 21 15:53:34 EDT 2011)
4200000: 241.545 seconds (Thu Jul 21 15:57:35 EDT 2011)
4300000: 193.555 seconds (Thu Jul 21 16:00:49 EDT 2011)
4400000: 190.949 seconds (Thu Jul 21 16:04:00 EDT 2011)
4500000: 184.433 seconds (Thu Jul 21 16:07:04 EDT 2011)
4600000: 231.709 seconds (Thu Jul 21 16:10:56 EDT 2011)
4700000: 243.0 seconds (Thu Jul 21 16:14:59 EDT 2011)
4800000: 310.156 seconds (Thu Jul 21 16:20:09 EDT 2011)
4900000: 318.421 seconds (Thu Jul 21 16:25:28 EDT 2011)
5000000: 378.112 seconds (Thu Jul 21 16:31:46 EDT 2011)
5100000: 265.648 seconds (Thu Jul 21 16:36:11 EDT 2011)
5200000: 295.086 seconds (Thu Jul 21 16:41:06 EDT 2011)
5300000: 297.678 seconds (Thu Jul 21 16:46:04 EDT 2011)
5400000: 329.256 seconds (Thu Jul 21 16:51:33 EDT 2011)
5500000: 336.571 seconds (Thu Jul 21 16:57:10 EDT 2011)
5600000: 398.64 seconds (Thu Jul 21 17:03:49 EDT 2011)
5700000: 351.158 seconds (Thu Jul 21 17:09:40 EDT 2011)
5800000: 410.561 seconds (Thu Jul 21 17:16:30 EDT 2011)
5900000: 689.369 seconds (Thu Jul 21 17:28:00 EDT 2011)
答案 0 :(得分:51)
一些提示:
在插入之前不要索引您的集合,因为插入会修改索引,这是一个开销。插入所有内容,然后创建索引。
而不是“保存”,使用mongoDB“batchinsert”,可以在1次操作中插入多条记录。因此,每批插入约5000个文档。 你会看到显着的性能提升。
请参阅插入here的方法#2,它需要插入一组文档而不是单个文档。 另请参阅this thread
中的讨论如果你想要更多的基准 -
这只是猜测,尝试使用预定义大尺寸的上限集合来存储您的所有数据。没有索引的上限集合具有非常好的插入性能。
答案 1 :(得分:6)
我有同样的事情。据我所知,它归结为索引值的随机性。每当插入新文档时,显然还需要更新所有基础索引。因为您将随机值(而不是顺序值)插入到这些索引中,所以您不断访问整个索引以查找新值的放置位置。
当所有索引都快乐地放在内存中时,这一切都很好,但是一旦它们变得太大,你需要开始点击磁盘来执行索引插入,然后磁盘开始抖动并写入性能死亡。
在您加载数据时,请尝试将db.collection.totalIndexSize()
与可用内存进行比较,您可能会发现这种情况。
最好的办法是在加载数据后创建索引。但是,当它是包含随机值(GUID,哈希等)的必需_id索引时,这仍然无法解决问题,那么您最好的方法可能是考虑分片或获取更多RAM。
答案 2 :(得分:4)
我在我的项目中做的是添加一些多线程(项目在C#中,但我希望代码是不言自明的)。在使用必要数量的线程后,结果表明将线程数设置为核心数会导致性能稍好(10-20%),但我认为这种提升是硬件特定的。这是代码:
public virtual void SaveBatch(IEnumerable<object> entities)
{
if (entities == null)
throw new ArgumentNullException("entities");
_repository.SaveBatch(entities);
}
public void ParallelSaveBatch(IEnumerable<IEnumerable<object>> batchPortions)
{
if (batchPortions == null)
throw new ArgumentNullException("batchPortions");
var po = new ParallelOptions
{
MaxDegreeOfParallelism = Environment.ProcessorCount
};
Parallel.ForEach(batchPortions, po, SaveBatch);
}
答案 3 :(得分:0)
另一种选择是尝试TokuMX。他们使用分形指数,这意味着it does not slow down over time as the database gets bigger。
TokuMX将作为自定义存储驱动程序包含在即将推出的MongoDB版本中。
当前版本的MongoDB在Linux下运行。我使用Vagrant很快就在Windows上运行。