我有一个刮刀将数据转储到mongodb,而我的另一个模块试图从mongodb中检索数据,但根据线路配置文件它非常慢
Line # Hits Time Per Hit % Time Line Contents
==============================================================
607 @profile
608 def get_item_iterator(self):
609 """
610 build a generator from the item collection
611 """
612 1 1 1.0 0.4 query = {'token': self.token}
613 # for item in self.collection.find(query):
614 # yield item
615 # return (item for item in self.collection.find(query))
616 1 263 263.0 98.9 items_cur=self.collection.find(query)
617 1 2 2.0 0.8 return items_cur
Total time: 0.168562 s
File: optim_id.py
Function: Identify at line 618
Line # Hits Time Per Hit % Time Line Contents
==============================================================
618 @profile
619 def Identify(self):
620 """
621 identify CTAs
622 """
623 1 2 2.0 0.0 try:
624 1 1 1.0 0.0 flag=0
625 1 280 280.0 0.2 items_cur=self.get_item_iterator()
626 112 158137 1411.9 93.8 for item in items_cur:
627 111 218 2.0 0.1 if flag==0:
所以你可以看到每个命中时间是巨大的,我怎么能显着减少这一点。 我听说列表理解比循环更快,我也试过但没有任何成功。
Line # Hits Time Per Hit % Time Line Contents
==============================================================
607 @profile
608 def get_item_iterator(self):
609 """
610 build a generator from the item collection
611 """
612 1 2 2.0 0.6 query = {'token': self.token}
613 # for item in self.collection.find(query):
614 # yield item
615 1 310 310.0 99.4 return (item for item in self.collection.find(query))
Total time: 0.150235 s
File: optim_id.py
Function: Identify at line 616
Line # Hits Time Per Hit % Time Line Contents
==============================================================
616 @profile
617 def Identify(self):
618 """
619 identify CTAs
620 """
621 1 2 2.0 0.0 try:
622 1 328 328.0 0.2 item_list=self.get_item_iterator()
623 1 139896 139896.0 93.1 item_record=item_list.next()
我的mongodb统计数据:
db.stats()
{
"db" : "scrapy_database",
"collections" : 102,
"objects" : 167007,
"avgObjSize" : 1091.1401797529445,
"dataSize" : 182228048,
"storageSize" : 310439936,
"numExtents" : 374,
"indexes" : 100,
"indexSize" : 6115648,
"fileSize" : 469762048,
"nsSizeMB" : 16,
"extentFreeList" : {
"num" : 4,
"totalSize" : 6029312
},
"dataFileVersion" : {
"major" : 4,
"minor" : 22
},
"ok" : 1
}
> collection=db['scraped_rawdata']
scrapy_database.scraped_rawdata
> collection.stats()
{
"ns" : "scrapy_database.scraped_rawdata",
"count" : 100451,
"size" : 121793232,
"avgObjSize" : 1212,
"numExtents" : 13,
"storageSize" : 168075264,
"lastExtentSize" : 46333952,
"paddingFactor" : 1,
"paddingFactorNote" : "paddingFactor is unused and unmaintained in 3.0. It remains hard coded to 1.0 for compatibility only.",
"userFlags" : 1,
"capped" : false,
"nindexes" : 1,
"totalIndexSize" : 3270400,
"indexSizes" : {
"_id_" : 3270400
},
"ok" : 1
}
但我正在查询总共111个
的项目> collection.find({"token":"9a9ec6086bb4a4a7ae8cd44b909b139930e561c3"}).count()
111
答案 0 :(得分:2)
尝试增加查询的bath_size
,似乎每个项目都需要点击数据库。同时为token
字段添加索引。