MongoDB在Map / Reduce上崩溃

时间:2014-08-28 04:00:40

标签: mongodb

自去年以来,我一直使用MongoDB作为1.5Tb +数据的主要存储空间。一切都很好,但最近我决定执行一些map-reduce对抗14,000 000个文档集合,我的生产实例已经关闭。 请看一下细节:

我的配置

Ubuntu 12.04.5 LTS,MongoDB 2.6.4,LVM(2个HDD,1.5TB +总共3TB +),24GB RAM(几乎全部免费)

Mongo配置是默认配置(除了logpath和dbpath参数)

Mongo日志:     

    2014-08-28T07:33:41.147+0400 [DataFileSync] flushing mmaps took 16177ms  for 777 files
    2014-08-28T07:33:44.004+0400 [conn13]       M/R: (1/3) Emit Progress: 9920300
    2014-08-28T07:33:47.178+0400 [conn13]       M/R: (1/3) Emit Progress: 9928100
    2014-08-28T07:33:50.004+0400 [conn13]       M/R: (1/3) Emit Progress: 9967800
    2014-08-28T07:33:53.115+0400 [conn13]       M/R: (1/3) Emit Progress: 10007800
    2014-08-28T07:33:56.009+0400 [conn13]       M/R: (1/3) Emit Progress: 10048800
    2014-08-28T07:33:59.050+0400 [conn13]       M/R: (1/3) Emit Progress: 10091200
    2014-08-28T07:34:02.530+0400 [conn13]       M/R: (1/3) Emit Progress: 10102300
    2014-08-28T07:34:05.510+0400 [conn13]       M/R: (1/3) Emit Progress: 10102400
    2014-08-28T07:34:08.932+0400 [conn13] SEVERE: Invalid access at address: 0x7cc8b2fe70b4
    2014-08-28T07:34:08.983+0400 [conn13] SEVERE: Got signal: 7 (Bus error).
    Backtrace:0x11e6111 0x11e54ee 0x11e55df 0x7f5a7031ecb0 0xf29cad 0xf32f28 0xf32770 0x8b601f 0x8b693a 0x982885 0x988485 0x9966d8 0x9a3355 0xa2889a 0xa29ce2 0xa2bea6 0xd5dd6d 0xb9fe62 0xba1440 0x770aef 
     mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11e6111]
     mongod() [0x11e54ee]
     mongod() [0x11e55df]
     /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7f5a7031ecb0]
     mongod(_ZN5mongo16NamespaceDetails5allocEPNS_10CollectionERKNS_10StringDataEi+0x1bd) [0xf29cad]
     mongod(_ZN5mongo19SimpleRecordStoreV111allocRecordEii+0x68) [0xf32f28]
     mongod(_ZN5mongo17RecordStoreV1Base12insertRecordEPKcii+0x60) [0xf32770]
     mongod(_ZN5mongo10Collection15_insertDocumentERKNS_7BSONObjEbPKNS_16PregeneratedKeysE+0x7f) [0x8b601f]
     mongod(_ZN5mongo10Collection14insertDocumentERKNS_7BSONObjEbPKNS_16PregeneratedKeysE+0x22a) [0x8b693a]
     mongod(_ZN5mongo2mr5State12_insertToIncERNS_7BSONObjE+0x85) [0x982885]
     mongod(_ZN5mongo2mr5State14reduceInMemoryEv+0x175) [0x988485]
     mongod(_ZN5mongo2mr5State35reduceAndSpillInMemoryStateIfNeededEv+0x148) [0x9966d8]
     mongod(_ZN5mongo2mr16MapReduceCommand3runERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0xcc5) [0x9a3355]
     mongod(_ZN5mongo12_execCommandEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x3a) [0xa2889a]
     mongod(_ZN5mongo7Command11execCommandEPS0_RNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0x1042) [0xa29ce2]
     mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x6c6) [0xa2bea6]
     mongod(_ZN5mongo11newRunQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x22ed) [0xd5dd6d]
     mongod() [0xb9fe62]
     mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x580) [0xba1440]
     mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x9f) [0x770aef]
    

在我第一次运行map-reduce之后,我创建了db.repairDatabase(),但是在第二次尝试map-reduce(修复后)后再次发生同样的崩溃。现在,我不知道如何完成我的m / r

任何想法,伙计们?

1 个答案:

答案 0 :(得分:3)

调查问题后,我最近想出了几件事:

正如评论中所建议的,我看了一下mongo jira的票SERVER-12849 并仔细检查我的日志。

/ var / log / syslog说:


    kernel: [1349503.760215] ata6.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x0
    Aug 28 08:18:41 overlord kernel: [1349503.760253] ata6.00: irq_stat 0x40000008
    Aug 28 08:18:41 overlord kernel: [1349503.760281] ata6.00: failed command: READ FPDMA QUEUED
    Aug 28 08:18:41 overlord kernel: [1349503.760318] ata6.00: cmd 60/08:00:10:48:92/00:00:84:00:00/40 tag 0 ncq 4096 in
    Aug 28 08:18:41 overlord kernel: [1349503.760318]          res 41/40:08:10:48:92/00:00:84:00:00/00 Emask 0x409 (media error) 
    Aug 28 08:18:41 overlord kernel: [1349503.760411] ata6.00: status: { DRDY ERR }
    Aug 28 08:18:41 overlord kernel: [1349503.760437] ata6.00: error: { UNC }
    Aug 28 08:18:41 overlord kernel: [1349503.788325] ata6.00: configured for UDMA/133
    Aug 28 08:18:41 overlord kernel: [1349503.788340] sd 5:0:0:0: [sdb] Unhandled sense code
    Aug 28 08:18:41 overlord kernel: [1349503.788343] sd 5:0:0:0: [sdb]  
    Aug 28 08:18:41 overlord kernel: [1349503.788345] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    Aug 28 08:18:41 overlord kernel: [1349503.788348] sd 5:0:0:0: [sdb]  
    Aug 28 08:18:41 overlord kernel: [1349503.788350] Sense Key : Medium Error [current] [descriptor]
    Aug 28 08:18:41 overlord kernel: [1349503.788353] Descriptor sense data with sense descriptors (in hex):
    Aug 28 08:18:41 overlord kernel: [1349503.788355]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
    Aug 28 08:18:41 overlord kernel: [1349503.788365]         84 92 48 10 
    Aug 28 08:18:41 overlord kernel: [1349503.788370] sd 5:0:0:0: [sdb]  
    Aug 28 08:18:41 overlord kernel: [1349503.788373] Add. Sense: Unrecovered read error - auto reallocate failed
    Aug 28 08:18:41 overlord kernel: [1349503.788376] sd 5:0:0:0: [sdb] CDB: 
    Aug 28 08:18:41 overlord kernel: [1349503.788377] Read(10): 28 00 84 92 48 10 00 00 08 00
    Aug 28 08:18:41 overlord kernel: [1349503.788387] end_request: I/O error, dev sdb, sector 2224179216
    Aug 28 08:18:41 overlord kernel: [1349503.788434] ata6: EH complete

看起来/ dev / sdb是罪魁祸首,让我们检查SMART状态(如jira所示)

    SMART Error Log Version: 1
    ATA Error Count: 135 (device log contains only the most recent five errors)
            CR = Command Register [HEX]
            FR = Features Register [HEX]
            SC = Sector Count Register [HEX]
            SN = Sector Number Register [HEX]
            CL = Cylinder Low Register [HEX]
            CH = Cylinder High Register [HEX]
            DH = Device/Head Register [HEX]
            DC = Device Command Register [HEX]
            ER = Error register [HEX]
            ST = Status register [HEX]
    Powered_Up_Time is measured from power on, and printed as
    DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
    SS=sec, and sss=millisec. It "wraps" after 49.710 days.

    Error 135 occurred at disk power-on lifetime: 11930 hours (497 days + 2 hours)
      When the command that caused the error occurred, the device was active or idle.

      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 51 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      61 00 08 ff ff ff 4f 00  49d+12:01:35.512  WRITE FPDMA QUEUED
      60 00 08 ff ff ff 4f 00  49d+12:01:33.380  READ FPDMA QUEUED
      ea 00 00 00 00 00 a0 00  49d+12:01:33.294  FLUSH CACHE EXT
      61 00 00 ff ff ff 4f 00  49d+12:01:33.292  WRITE FPDMA QUEUED
      ea 00 00 00 00 00 a0 00  49d+12:01:33.153  FLUSH CACHE EXT

    Error 134 occurred at disk power-on lifetime: 11930 hours (497 days + 2 hours)
      When the command that caused the error occurred, the device was active or idle.

      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 51 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      61 00 08 ff ff ff 4f 00  49d+11:17:00.189  WRITE FPDMA QUEUED
      61 00 10 ff ff ff 4f 00  49d+11:17:00.189  WRITE FPDMA QUEUED
      61 00 28 ff ff ff 4f 00  49d+11:17:00.188  WRITE FPDMA QUEUED
      61 00 08 ff ff ff 4f 00  49d+11:17:00.188  WRITE FPDMA QUEUED
      61 00 08 ff ff ff 4f 00  49d+11:17:00.188  WRITE FPDMA QUEUED

    Error 133 occurred at disk power-on lifetime: 11930 hours (497 days + 2 hours)
      When the command that caused the error occurred, the device was active or idle.

所以,我们可以看到/ dev / sdb上有错误,让我们做最后的检查 - 将整个数据复制到另一台主机并尝试在那里运行原始的map-reduce脚本。

结果是成功

在我的情况下, mongo是好的。 mongo日志信号中的(总线错误)日志条目似乎是时候检查你的硬件了。