Question

我有一个刮刀将数据转储到mongodb，而我的另一个模块试图从mongodb中检索数据，但根据线路配置文件它非常慢

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   607                                               @profile
   608                                               def get_item_iterator(self):
   609                                                   """
   610                                                   build a generator from the item collection
   611                                                   """
   612         1            1      1.0      0.4          query = {'token': self.token}
   613                                                   # for item in self.collection.find(query):
   614                                                   #     yield item
   615                                                   # return (item for item in self.collection.find(query))
   616         1          263    263.0     98.9          items_cur=self.collection.find(query)
   617         1            2      2.0      0.8          return items_cur

Total time: 0.168562 s
File: optim_id.py
Function: Identify at line 618

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   618                                               @profile
   619                                               def Identify(self):
   620                                                   """
   621                                                   identify CTAs
   622                                                   """
   623         1            2      2.0      0.0          try:
   624         1            1      1.0      0.0              flag=0
   625         1          280    280.0      0.2              items_cur=self.get_item_iterator()
   626       112       158137   1411.9     93.8              for item in items_cur:
   627       111          218      2.0      0.1                  if flag==0:

所以你可以看到每个命中时间是巨大的，我怎么能显着减少这一点。我听说列表理解比循环更快，我也试过但没有任何成功。

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   607                                               @profile
   608                                               def get_item_iterator(self):
   609                                                   """
   610                                                   build a generator from the item collection
   611                                                   """
   612         1            2      2.0      0.6          query = {'token': self.token}
   613                                                   # for item in self.collection.find(query):
   614                                                   #     yield item
   615         1          310    310.0     99.4          return (item for item in self.collection.find(query))

Total time: 0.150235 s
File: optim_id.py
Function: Identify at line 616

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   616                                               @profile
   617                                               def Identify(self):
   618                                                   """
   619                                                   identify CTAs
   620                                                   """
   621         1            2      2.0      0.0          try:
   622         1          328    328.0      0.2              item_list=self.get_item_iterator()
   623         1       139896 139896.0     93.1              item_record=item_list.next()

我的mongodb统计数据：

db.stats()
{
    "db" : "scrapy_database",
    "collections" : 102,
    "objects" : 167007,
    "avgObjSize" : 1091.1401797529445,
    "dataSize" : 182228048,
    "storageSize" : 310439936,
    "numExtents" : 374,
    "indexes" : 100,
    "indexSize" : 6115648,
    "fileSize" : 469762048,
    "nsSizeMB" : 16,
    "extentFreeList" : {
        "num" : 4,
        "totalSize" : 6029312
    },
    "dataFileVersion" : {
        "major" : 4,
        "minor" : 22
    },
    "ok" : 1
}
> collection=db['scraped_rawdata']
scrapy_database.scraped_rawdata
> collection.stats()
{
    "ns" : "scrapy_database.scraped_rawdata",
    "count" : 100451,
    "size" : 121793232,
    "avgObjSize" : 1212,
    "numExtents" : 13,
    "storageSize" : 168075264,
    "lastExtentSize" : 46333952,
    "paddingFactor" : 1,
    "paddingFactorNote" : "paddingFactor is unused and unmaintained in 3.0. It remains hard coded to 1.0 for compatibility only.",
    "userFlags" : 1,
    "capped" : false,
    "nindexes" : 1,
    "totalIndexSize" : 3270400,
    "indexSizes" : {
        "_id_" : 3270400
    },
    "ok" : 1
}

但我正在查询总共111个

的项目

> collection.find({"token":"9a9ec6086bb4a4a7ae8cd44b909b139930e561c3"}).count()
111

Answer 1

尝试增加查询的bath_size，似乎每个项目都需要点击数据库。同时为token字段添加索引。

Python SpeedUp：从mongodb缓慢读取数据

1 个答案: