#问题#
在解析数百万封电子邮件时,方法Mail.read_from_string(mail_as_string) 太慢。
#问题#
如何加快电子邮件解析速度?
#Context#
我已经为您提供了足够的上下文来理解我的用例。
收到电子邮件
我通过Rubys Net::IMAP连接到某个外部IMAP服务器。
@imap = Net::IMAP.new("imap.gmail.com", 993, true) # A few login steps are omitted here
我提取email
:
email = @imap.uid_fetch("85113", ["BODY[HEADER.FIELDS (FROM TO DATE SUBJECT)]", "RFC822"]) # => #<struct Net::IMAP::FetchData seqno=55395, attr={"UID"=>85113, "RFC822"=>"Delivered-To: my@email.com\r\nReceived: by 10.223.148.78 with SMTP id o14csp218630fav;\r\n Tue, 18 Dec 2012 16:55:50 -0800 (PST)\r\nX-Received: by 10.194.177.199 with SMTP id cs7mr8044338wjc.41.1355878548414;\r\n Tue, 18 Dec 2012 16:55:48 -0800 (PST)\r\nReturn-Path: <noreply@128secure.net>\r\nReceived: from exproxy-1.exserver.dk (exproxy-1.exserver.dk. [195.69.129.162])\r\n by mx.google.com with ESMTP id m13si17440569wie.32.2012.12.18.16.55.47;\r\n Tue, 18 Dec 2012 16:55:48 -0800 (PST)\r\nReceived-SPF: pass (google.com: domain of noreply@128secure.net designates 195.69.129.162 as permitted sender) client-ip=195.69.129.162;\r\nAuthentication-Results: mx.google.com; spf=pass (google.com: domain of noreply@128secure.net designates 195.69.129.162 as permitted sender) smtp.mail=noreply@128secure.net\r\nReceived: by exproxy-1.exserver.dk (Postfix, from userid 65534)\r\n\tid 5330511CDCB; Wed, 19 Dec 2012 01:55:42 +0100 (CET)\r\nReceived: from EXHUB02.exchangeserver.dk (exhub02.exchangeserver.dk [193.239.98.62])\r\n\tby exproxy-1.exserver.dk (Postfix) with ESMTP id 4735211A58E\r\n\tfor <my_email.com@exfwd01.scannet.dk>; Wed, 19 Dec 2012 01:55:42 +0100 (CET)\r\nReceived: from front07.exserver.dk (195.69.129.92) by\r\n EXHUB02.exchangeserver.dk (193.239.98.60) with Microsoft SMTP Server id\r\n 8.2.176.0; Wed, 19 Dec 2012 01:58:49 +0100\r\nReceived: from localhost (front07.exserver.dk [127.0.0.1])\tby\r\n front07.exserver.dk (Postfix) with ESMTP id 0B8287B4015\tfor\r\n <my@email.com>; Wed, 19 Dec 2012 01:55:45 +0100 (CET)\r\nX-Virus-Scanned: amavisd-new at exserver.dk\r\nReceived: from front07.exserver.dk ([127.0.0.1])\tby localhost\r\n (front07.exserver.dk [127.0.0.1]) (amavisd-new, port 10024)\twith ESMTP id\r\n vrjzzlpsuXn6 for <my@email.com>;\tWed, 19 Dec 2012 01:55:42 +0100 (CET)\r\nReceived: from shopmail.scannet.dk (shopmail.scannet.dk [195.69.129.120])\tby\r\n front07.exserver.dk (Postfix) with ESMTP id A6F797B4002\tfor\r\n <my@email.com>; Wed, 19 Dec 2012 01:55:42 +0100 (CET)\r\nReceived: from WebSrv100 (unknown [193.239.97.100])\tby shopmail.scannet.dk\r\n (Postfix) with ESMTP id 6DFEF7FE4E\tfor <my@email.com>; Wed, 19 Dec\r\n 2012 01:55:34 +0100 (CET)\r\nMIME-Version: 1.0\r\nFrom: me <noreply@128secure.net>\r\nTo: me <my@email.com>\r\nReply-To: <my@email.com>\r\nDate: Wed, 19 Dec 2012 01:55:44 +0100\r\nSubject: Ordre (Kopi)\r\nContent-Type: text/html; charset=\"utf-8\"\r\nContent-Transfer-Encoding: base64\r\nMessage-ID: <20121219005542.A6F797B4002@front07.exserver.dk>\r\nX-ScanNet-Forward: TTL=5\r\n\r\n\r\nT3JkcmUgZnJhIEdhbWVQSU1QOjxicj4NCi0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t\r\nLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS08YnI+DQpPcmRy\r\nZWRhdG86IDE5LTEyLTIwMTIgMDE6NTU6NDM8YnI+DQpPcmRyZW51bW1lcjogMTA4NjU0PGJy\r\nPg0KVHJhbnNha3Rpb25zSUQ6IDE2NzI4Ng0KPGJyPjxicj4NCkZha3R1cmVyaW5nc2FkcmVz\r\nc2U6PGJyPg0KLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t\r\nLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLTxicj48YnI+DQpMYXJzIFBldGVyc2VuPGJy\r\nIC8+QnlzdMOmdm5ldmVqIDY2LCBCw7hqZGVuPGJyIC8+NTYwMCBGYWFib3JnPGJyIC8+RGVu\r\nbWFyazxiciAvPlRMRjo6IDYwNjczNzY3PGJyIC8+PGEgaHJlZj0ibWFpbHRvOmZhc3RodWdv\r\nQGhvdG1haWwuY29tIj5mYXN0aHVnb0Bob3RtYWlsLmNvbTwvYT48YnIgLz4NCjxicj48YnI+\r\nDQpMZXZlcmluZ3NhZHJlc3NlOjxicj4NCi0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t\r\nLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS08YnI+PGJyPg0K\r\nTGFycyBQZXRlcnNlbjxiciAvPkJ5c3TDpnZuZXZlaiA2NiwgQsO4amRlbjxiciAvPjU2MDAg\r\nRmFhYm9yZzxiciAvPkRlbm1hcms8YnIgLz5UTEY6OiA2MDY3Mzc2NzxiciAvPjxhIGhyZWY9\r\nIm1haWx0bzpmYXN0aHVnb0Bob3RtYWlsLmNvbSI+ZmFzdGh1Z29AaG90bWFpbC5jb208L2E+\r\nPGJyIC8+DQo8YnI+PGJyPg0KT3JkcmVkYXRhOjxicj4NCi0tLS0tLS0tLS0tLS0tLS0tLS0t\r\nLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS08\r\nYnI+DQoNCiAgMSwwMCBzdGsuIFgzOiBUZXJyYW4gQ29uZmxpY3QgUEMgKDMxNDMyKSDDoSBE\r\nS0sgMjk4LDM5IC0gSWFsdDogREtLIDM3Miw5OQ0KPGJyPg0KICAxLDAwIHN0ay4gQ3J5c2lz\r\nIE1heGltdW0gRWRpdGlvbiBQQyAoNDgwNDgpIMOhIERLSyAxNTcsNTkgLSBJYWx0OiBES0sg\r\nMTk2LDk5DQo8YnI+DQo8YnI+DQpCZXRhbGluZzogMjogRGFuc2tlIGtyZWRpdGtvcnQgW3Ry\r\nYW5zYWt0aW9uc2dlYnlyIDEsMjUlXSAoREtLIDcsMTMpDQo8YnI+DQpGb3JzZW5kZWxzZTog\r\nIChES0sgMCwwMCkNCjxicj48YnI+DQpTYW1sZXQgcHJpcyA6IERLSyA1NzcsMTENCjxicj4N\r\nCkhlcmFmIG1vbXM6IERLSyAxMTUsNDMNCg==\r\n\r\n", "BODY[HEADER.FIELDS (FROM TO DATE SUBJECT)]"=>"From: me <noreply@128secure.net>\r\nTo: me <my@email.com>\r\nDate: Wed, 19 Dec 2012 01:55:44 +0100\r\nSubject: Ordre (Kopi)\r\n\r\n"}>
获取电子邮件信息
header_attr = email.attr["BODY[HEADER.FIELDS (FROM TO DATE SUBJECT)]"]
header = Mail.read_from_string(header_attr) # => #<Mail::Message:70179653144480, Multipart: false, Headers: <Date: Wed, 19 Dec 2012 01:55:44 +0100>, <From: me <noreply@128secure.net>>, <To: me <my@email.com>>, <Subject: Ordre (Kopi)>>
然后我可以访问以下内容:
header.date.to_time # => 2012-12-18 16:55:44 -0800
header.from.first # => noreply@128secure.net
header.to.first # => my@email.com
header.subject # => Ordre (Kopi)
延迟1:获取header
需要 0.010000秒:
puts Benchmark.measure { Mail.read_from_string(header_attr) } # => 0.010000 0.000000 0.010000 ( 0.004163)
获取电子邮件(正文)
message_attr = email.attr["RFC822"]
message = Mail.read_from_string(message_attr) # => #<Mail::Message:70179643743140, Multipart: false, Headers: <Return-Path: <noreply@128secure.net>>, <Received: by 10.223.148.78 with SMTP id o14csp218630fav; Tue, 18 Dec 2012 16:55:50 -0800 (PST)>, <Received: from exproxy-1.exserver.dk (exproxy-1.exserver.dk. [195.69.129.162]) by mx.google.com with ESMTP id m13si17440569wie.32.2012.12.18.16.55.47; Tue, 18 Dec 2012 16:55:48 -0800 (PST)>, <Received: by exproxy-1.exserver.dk (Postfix, from userid 65534) id 5330511CDCB; Wed, 19 Dec 2012 01:55:42 +0100 (CET)>, <Received: from EXHUB02.exchangeserver.dk (exhub02.exchangeserver.dk [193.239.98.62]) by exproxy-1.exserver.dk (Postfix) with ESMTP id 4735211A58E for <my_email.com@exfwd01.scannet.dk>; Wed, 19 Dec 2012 01:55:42 +0100 (CET)>, <Received: from front07.exserver.dk (195.69.129.92) by EXHUB02.exchangeserver.dk (193.239.98.60) with Microsoft SMTP Server id 8.2.176.0; Wed, 19 Dec 2012 01:58:49 +0100>, <Received: from localhost (front07.exserver.dk [127.0.0.1]) by front07.exserver.dk (Postfix) with ESMTP id 0B8287B4015 for <my@email.com>; Wed, 19 Dec 2012 01:55:45 +0100 (CET)>, <Received: from front07.exserver.dk ([127.0.0.1]) by localhost (front07.exserver.dk [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id vrjzzlpsuXn6 for <my@email.com>; Wed, 19 Dec 2012 01:55:42 +0100 (CET)>, <Received: from shopmail.scannet.dk (shopmail.scannet.dk [195.69.129.120]) by front07.exserver.dk (Postfix) with ESMTP id A6F797B4002 for <my@email.com>; Wed, 19 Dec 2012 01:55:42 +0100 (CET)>, <Received: from WebSrv100 (unknown [193.239.97.100]) by shopmail.scannet.dk (Postfix) with ESMTP id 6DFEF7FE4E for <my@email.com>; Wed, 19 Dec 2012 01:55:34 +0100 (CET)>, <Date: Wed, 19 Dec 2012 01:55:44 +0100>, <From: me <noreply@128secure.net>>, <Reply-To: <my@email.com>>, <To: me <my@email.com>>, <Message-ID: <20121219005542.A6F797B4002@front07.exserver.dk>>, <Subject: Ordre (Kopi)>, <Mime-Version: 1.0>, <Content-Type: text/html; charset="utf-8">, <Content-Transfer-Encoding: base64>, <Delivered-To: my@email.com>, <X-Received: by 10.194.177.199 with SMTP id cs7mr8044338wjc.41.1355878548414; Tue, 18 Dec 2012 16:55:48 -0800 (PST)>, <Received-SPF: pass (google.com: domain of noreply@128secure.net designates 195.69.129.162 as permitted sender) client-ip=195.69.129.162;>, <Authentication-Results: mx.google.com; spf=pass (google.com: domain of noreply@128secure.net designates 195.69.129.162 as permitted sender) smtp.mail=noreply@128secure.net>, <X-Virus-Scanned: amavisd-new at exserver.dk>, <X-ScanNet-Forward: TTL=5>>
为了确保UTF-8编码,我执行以下操作:
if message.multipart?
body = message.text_part.decoded.force_encoding("UTF-8").encode("UTF-8")
else
body = message.body.decoded.force_encoding(message.charset).encode("UTF-8") # => "Ordre fra mig:<br>\r\n-------------------------------------------------------------------------<br>\r\nOrdredato: 19-12-2012 01:55:43<br>\r\nOrdrenummer: 108654<br>\r\nTransaktionsID: 167286\r\n<br><br>\r\nFaktureringsadresse:<br>\r\n-------------------------------------------------------------------------<br><br>\r\nLars Larsen<br />En vej 66, Bøjden<br />1900 Frederiksberg<br />Denmark<br />TLF:: 12345678<br /><a href=\"mailto:din@email.com\">din@email.com</a><br />\r\n<br><br>\r\nLeveringsadresse:<br>\r\n-------------------------------------------------------------------------<br><br>\r\nLars Larsen<br />En vej 66<br />1900 Frederiksberg<br />Denmark<br />TLF:: 12345678<br /><a href=\"mailto:en@mail.com\">en@mail.com</a><br />\r\n<br><br>\r\nOrdredata:<br>\r\n-------------------------------------------------------------------------<br>\r\n\r\n 1,00 stk. X3: Terran Conflict PC (31432) á DKK 298,39 - Ialt: DKK 372,99\r\n<br>\r\n 1,00 stk. Crysis Maximum Edition PC (48048) á DKK 157,59 - Ialt: DKK 196,99\r\n<br>\r\n<br>\r\nBetaling: 2: Danske kreditkort [transaktionsgebyr 1,25%] (DKK 7,13)\r\n<br>\r\nForsendelse: (DKK 0,00)\r\n<br><br>\r\nSamlet pris : DKK 577,11\r\n<br>\r\nHeraf moms: DKK 115,43\r\n"
end
延迟2:获取message
需要 0.050000秒:
puts Benchmark.measure { Mail.read_from_string(message_attr) } # => 0.050000 0.000000 0.050000 ( 0.054013)
答案 0 :(得分:0)
如果您已将电子邮件解析为字段...
header.date.to_time # => 2012-12-18 16:55:44 -0800
header.from.first # => noreply@128secure.net
header.to.first # => my@email.com
header.subject # => Ordre (Kopi)
...然后为什么让Mail :: new再次解析它?不要打电话给Mail.read_from_string(message_attr)
,而是尝试这样的事情:
message = Mail.new({to: header.date.to_time,
from: header.from.first,
subject: header.subject,
body: body })