使用Python从电子邮件正文中提取URL?

时间:2015-10-14 01:18:50

标签: python regex python-2.7 parsing extraction

给出this原始电子邮件:

[('96 (RFC822 {17888}',
'Delivered-To: example@gmail.com\r\nReceived: by 10.182.129.229 with SMTP id nz5csp2388417obb;\r\n        Tue, 13 Oct 2015 14:57:14 -0700 (PDT)\r\nX-Received: by 10.68.136.103 with SMTP id pz7mr5507255pbb.114.1444773434163;\r\n        Tue, 13 Oct 2015 14:57:14 -0700 (PDT)\r\nReturn-Path: <t0721aa7a92-ed37dd57c-9df2edd3ab1d4c49a5c9ac3a0569baab@bounce.twitter.com>\r\nReceived: from spruce-goose-bc.twitter.com (spruce-goose-bc.twitter.com. [199.59.150.98])\r\n        by mx.google.com with ESMTPS id xm2si7949727pbb.66.2015.10.13.14.57.13\r\n        for <example@gmail.com>\r\n        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);\r\n        Tue, 13 Oct 2015 14:57:14 -0700 (PDT)\r\nReceived-SPF: pass (google.com: domain of t0721aa7a92-ed37dd57c-9df2edd3ab1d4c49a5c9ac3a0569baab@bounce.twitter.com designates 199.59.150.98 as permitted sender) client-ip=199.59.150.98;\r\nAuthentication-Results: mx.google.com;\r\n       spf=pass (google.com: domain of t0721aa7a92-ed37dd57c-9df2edd3ab1d4c49a5c9ac3a0569baab@bounce.twitter.com designates 199.59.150.98 as permitted sender) smtp.mailfrom=t0721aa7a92-ed37dd57c-9df2edd3ab1d4c49a5c9ac3a0569baab@bounce.twitter.com;\r\n       dkim=pass header.i=@twitter.com;\r\n       dmarc=pass (p=REJECT dis=NONE) header.from=twitter.com\r\nDKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=twitter.com;\r\n\ts=dkim-201406; t=1444773433;\r\n\tbh=WBJ/04fcxapn9W2moQ6bGL1p7salO/SDhe2f3COz1us=;\r\n\th=Date:From:To:Subject:MIME-Version:Content-Type:Message-ID;\r\n\tb=tvyrM/Sz+g0WemkLWTYoarsftOM0Y4jQAWCNdqRm6W+5kBG43CP2q6woxrtDqgYHg\r\n\t o/zPvMa5nIPjoOfslv0YCUlhfuVjr0V/6InNMl65s3/zGRMlCQxQjS+UGsQrF2zH6Z\r\n\t G7pWHMTml1NxI2r77nuOhSyhknNFCA9pl0SkeNfoyK8jcIo6rNS2uugFBw5Ta/fS8i\r\n\t RMXcNpLA35k4Znvboe2aiZQg7ZY6NjbtNT3X6Ln4xuAgLkjeS/BfDBvd6M8CZ8yIT8\r\n\t 7xStI8xTfT/zKqcK+35yqnAqQ3QD5oll/DWxQatFUIYzLsgw2DV39XRo11y6OTdDim\r\n\t KNS2DTEjaOsBg==\r\nX-MSFBL: eyJ1IjoiaW5nbGVzbWFuYWd1YUBnbWFpbC5jb21AMTRAMzgxNjkwOTc5M0AwQDJj\r\n\tMjQ4NDVjZTJjOGMyNjI0NDMxY2MzZDBlOGY3NTZhNDVjNGI4MzQiLCJnIjoiRXZl\r\n\tcnl0aGluZyIsImIiOiJzbWYxLWJkcC0yMy1zcjEtRXZlcnl0aGluZy4xOTgiLCJy\r\n\tIjoiaW5nbGVzbWFuYWd1YUBnbWFpbC5jb20ifQ==\r\nDate: Tue, 13 Oct 2015 21:57:13 +0000\r\nFrom: Twitter <confirm@twitter.com>\r\nTo: example <example@gmail.com>\r\nSubject: Confirm your Twitter account, example\r\nMIME-Version: 1.0\r\nContent-Type: multipart/alternative; \r\n\tboundary="----=_Part_44683898_1221426234.1444773433942"\r\nFeedback-ID: 16481b2a2bd9895bc6fbf92980687bb5fdd96d63782c26cd:16481b2a2bd9895bc6fbf92980687bb5fdd96d63782c26cd:none:twitterESP\r\nMessage-ID: <68.DA.14434.93E7D165@twitter.com>\r\n\r\n------=_Part_44683898_1221426234.1444773433942\r\nContent-Type: text/plain; charset=UTF-8\r\nContent-Transfer-Encoding: 7bit\r\n\r\nexample,\r\n\r\nConfirm your email address to complete your Twitter account. It\'s easy - just click on the button below.\r\n\r\nClick on the link below or copy and paste it into a browser:\r\n\r\nhttps://twitter.com/i/redirect?url=https%3A%2F%2Ftwitter.com%2Faccount%2Fconfirm_user_email%2F3816909793%2F9CE5D-H4F5D-144477%3Ft%3D1%26cn%3DZW1haWxfY2hhbmdlX25vdGljZV9uZXh0%26sig%3Da6878f323b83b61ceb5eaa8fbdb2214d25fc65e7%26al%3D1%26iid%3D9df2edd3ab1d4c49a5c9ac3a0569baab%26ac%3D1%26autoactions%3D1444773433%26uid%3D3816909793%26nid%3D14%2B309&amp;t=1&amp;cn=ZW1haWxfY2hhbmdlX25vdGljZV9uZXh0&amp;sig=2b56e3a59dd6b182afaf3a0030a96b26ccc67d73&amp;iid=9df2edd3ab1d4c49a5c9ac3a0569baab&amp;uid=3816909793&amp;nid=14+309\r\n------=_Part_44683898_1221426234.1444773433942\r\nContent-Type: text/html; charset=UTF-8\r\nContent-Transfer-Encoding: quoted-printable\r\n\r\n<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/htm=\r\nl4/strict.dtd">\r\n<html>\r\n<head>\r\n<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dutf-8" />\r\n<meta name=3D"viewport" content=3D"width=3Ddevice-width, minimum-scale=3D1.=\r\n0, maximum-scale=3D1.0, user-scalable=3D0" />\r\n<meta name=3D"apple-mobile-web-app-capable" content=3D"yes" />\r\n<style type=3D"text/css">\r\n\r\n@media only screen and (max-device-width: 420px) {\r\ntd[class=3D"spacer"]{\r\nfont-size:4px !important;\r\n\r\n}\r\n\r\nspan[class=3D"address"] a {\r\n\r\nline-height:18px !important;\r\n}\r\n\r\n\r\ntd[class=3D"margins"]{\r\nwidth:18px !important;\r\n}\r\ntd[class=3D"logo_space"]{\r\nheight:12px !important;\r\n}\r\n}\r\n\r\n@media only screen and (max-device-width: 480px) {\r\n\r\ntable[class=3D"collapse"]{\r\nwidth:100% !important;\r\n}\r\n\r\ndiv[class=3D"collapse"]{\r\nwidth:100% !important;\r\n}\r\n\r\n\r\ntd[class=3D"body_text"] {\r\nfont-size:14px !important;\r\nline-height:22px !important;\r\n\r\n\r\n}\r\n\r\ntd[class=3D"greeting"]{\r\nfont-size:14px !important;\r\n\r\n}\r\n\r\n\r\ntd[class=3D"v_space"]{\r\nheight:8px !important;\r\n\r\n}\r\n\r\n\r\nspan[class=3D"address"]{\r\ndisplay:block !important;\r\nwidth:240px !important;\r\n}\r\ntd[class=3D"cut"]{\r\ndisplay:none !important;\r\n}\r\n\r\n}\r\n</style>\r\n</head>\r\n<body bgcolor=3D"#e1e8ed" style=3D"margin:0;padding:0;-webkit-text-size-adj=\r\nust:100%;-ms-text-size-adjust:100%;">\r\n<table cellpadding=3D"0" cellspacing=3D"0" border=3D"0" width=3D"100%" bgco=\r\nlor=3D"#e1e8ed" style=3D"background-color:#e1e8ed;padding:0;margin:0;line-h=\r\neight:1px;font-size:1px;" class=3D"body_wrapper">\r\n<tbody>\r\n<tr>\r\n<td align=3D"center" style=3D"padding:0;margin:0;line-height:1px;font-size:=\r\n1px;">\r\n<table class=3D"collapse" id=3D"header" align=3D"center" width=3D"500" styl=\r\ne=3D"width: 500px;padding:0;margin:0;line-height:1px;font-size:1px;" bgcolo=\r\nr=3D"#ffffff" cellpadding=3D"0" cellspacing=3D"0" border=3D"0">\r\n<tbody>\r\n<tr>\r\n<td style=3D"min-width: 500px;height:1px;padding:0;margin:0;line-height:1px=\r\n;font-size:1px;" class=3D"cut"> <img src=3D"https://ea.twimg.com/email/self=\r\n_serve/media/spacer-1402696023930.png" style=3D"min-width: 500px;height:1px=\r\n;margin:0;padding:0;display:block;-ms-interpolation-mode:bicubic;border:non=\r\ne;outline:none;" /> </td>\r\n</tr>\r\n</tbody>\r\n</table> </td>\r\n</tr>\r\n<tr>\r\n<td align=3D"center" style=3D"padding:0;margin:0;line-height:1px;font-size:=\r\n1px;">\r\n<!--///////////////////header///////////////////////////-->\r\n<table class=3D"collapse" id=3D"header" align=3D"center" width=3D"500" styl=\r\ne=3D"width:500px;background-color:#ffffff;padding:0;margin:0;line-height:1p=\r\nx;font-size:1px;" bgcolor=3D"#ffffff" cellpadding=3D"0" cellspacing=3D"0" b=\r\norder=3D"0">\r\n<tbody>\r\n<tr>\r\n<td height=3D"15" style=3D"height:15px;padding:0;margin:0;line-height:1px;f=\r\nont-size:1px;" class=3D"logo_space"> &nbsp; </td>\r\n</tr>\r\n<tr>\r\n<td style=3D"padding:0;margin:0;line-height:1px;font-size:1px;">\r\n<table cellpadding=3D"0" cellspacing=3D"0" border=3D"0" width=3D"100%" styl=\r\ne=3D"width:100%;padding:0;margin:0;line-height:1px;font-size:1px;">\r\n<tbody>\r\n<tr>\r\n<td align=3D"left" width=3D"15" style=3D"width:15px;padding:0;margin:0;line=\r\n-height:1px;font-size:1px;"></td>\r\n<td align=3D"left" width=3D"28" style=3D"padding:0;margin:0;line-height:1px=\r\n;font-size:1px;"> <a href=3D"https://twitter.com/i/redirect?url=3Dhttps%3A%=\r\n2F%2Ftwitter.com%3Fcn%3DZW1haWxfY2hhbmdlX25vdGljZV9uZXh0%26refsrc%3Demail&a=\r\nmp;t=3D1&amp;cn=3DZW1haWxfY2hhbmdlX25vdGljZV9uZXh0&amp;sig=3Dfe1cdb1344cee3=\r\nb9db0674bd2ce2f22397f739d7&amp;iid=3D9df2edd3ab1d4c49a5c9ac3a0569baab&amp;u=\r\nid=3D3816909793&amp;nid=3D14+21" style=3D"text-decoration:none;border-style=\r\n:none;border:0;padding:0;margin:0;"><img align=3D"left" width=3D"28" src=3D=\r\n"https://ea.twimg.com/email/self_serve/media/logo-1400528502322.png" style=\r\n=3D"width:28px;padding-bottom:2px;margin:0;padding:0;display:block;-ms-inte=\r\nrpolation-mode:bicubic;border:none;outline:none;" /></a> </td>\r\n<td align=3D"left" width=3D"10" style=3D"width:10px;padding:0;margin:0;line=\r\n-height:1px;font-size:1px;"></td>\r\n<td align=3D"left" class=3D"greeting" style=3D"padding:0;margin:0;line-heig=\r\nht:1px;font-size:1px;font-family:\'Helvetica Neue Light\', Helvetica, Arial, =\r\nsans-serif;-webkit-font-smoothing:antialiased;-webkit-text-size-adjust:none=\r\n;color:#66757f;font-size:16px;padding:0px;margin:0px;font-weight:300;line-h=\r\neight:100%;text-align:left;"> example, </td>\r\n</tr>\r\n</tbody>\r\n</table> </td>\r\n</tr>\r\n<tr>\r\n<td height=3D"14" style=3D"height:14px;padding:0;margin:0;line-height:1px;f=\r\nont-size:1px;" class=3D"logo_space"> &nbsp; </td>\r\n</tr>\r\n</tbody>\r\n</table>\r\n<!--////////////////////border//////////////////////////-->\r\n<table class=3D"collapse" align=3D"center" width=3D"500" style=3D"width:500=\r\npx;background-color:#ffffff;padding:0;margin:0;line-height:1px;font-size:1p=\r\nx;" cellpadding=3D"0" cellspacing=3D"0" border=3D"0">\r\n<tbody>\r\n<tr id=3D"border">\r\n<td colspan=3D"2" height=3D"1" style=3D"line-height:1px;display:block;heigh=\r\nt:1px;background-color:#e1e8ed;padding:0;margin:0;line-height:1px;font-size=\r\n:1px;"></td>\r\n</tr>\r\n</tbody>\r\n</table>\r\n<!--//////////////////////////////////////////////-->\r\n<table class=3D"collapse" align=3D"center" width=3D"500" style=3D"width:500=\r\npx;background-color:#ffffff;padding:0;margin:0;line-height:1px;font-size:1p=\r\nx;" cellpadding=3D"0" cellspacing=3D"0" border=3D"0">\r\n<tbody>\r\n<tr>\r\n<td width=3D"50" style=3D"width:50px;padding:0;margin:0;line-height:1px;fon=\r\nt-size:1px;" class=3D"margins"></td>\r\n<td align=3D"center" style=3D"padding:0;margin:0;line-height:1px;font-size:=\r\n1px;">\r\n<table width=3D"100%" align=3D"center" cellpadding=3D"0" cellspacing=3D"0" =\r\nborder=3D"0" class=3D"collapse" style=3D"padding:0;margin:0;line-height:1px=\r\n;font-size:1px;">\r\n<tbody>\r\n<tr>\r\n<td height=3D"30" style=3D"height:30px;padding:0;margin:0;line-height:1px;f=\r\nont-size:1px;"></td>\r\n</tr>\r\n<tr>\r\n<td align=3D"left" style=3D"padding:0;margin:0;line-height:1px;font-size:1p=\r\nx;"> <span class=3D"headline_1" style=3D"font-family:\'Helvetica Neue Light\'=\r\n, Helvetica, Arial, sans-serif;-webkit-font-smoothing:antialiased;-webkit-t=\r\next-size-adjust:none;color:#66757f;font-size:28px;padding:0px;margin:0px;fo=\r\nnt-weight:300;line-height:100%;text-align:left;">Final step...</span> </td>\r\n</tr>\r\n<tr>\r\n<td height=3D"12" style=3D"height:12px;padding:0;margin:0;line-height:1px;f=\r\nont-size:1px;" class=3D"v_space"></td>\r\n</tr>\r\n<tr>\r\n<td align=3D"left" class=3D"body_text" style=3D"padding:0;margin:0;line-hei=\r\nght:1px;font-size:1px;font-family:\'Helvetica Neue Light\', Helvetica, Arial,=\r\n sans-serif;-webkit-font-smoothing:antialiased;-webkit-text-size-adjust:non=\r\ne;color:#66757f;font-size:16px;padding:0px;margin:0px;font-weight:300;line-=\r\nheight:23px;text-align:left;"> Confirm your email address to complete your =\r\nTwitter account. It\'s easy =E2=80=94 just click on the button below. </td>\r\n</tr>\r\n<!--*********** button ************-->\r\n<tr>\r\n<td height=3D"22" style=3D"height:22px;padding:0;margin:0;line-height:1px;f=\r\nont-size:1px;"></td>\r\n</tr>\r\n<tr>\r\n<td align=3D"left" class=3D"button" style=3D"padding:0;margin:0;line-height=\r\n:1px;font-size:1px;">\r\n<table bgcolor=3D"#55acee" height=3D"40" border=3D"0" cellspacing=3D"0" cel=\r\nlpadding=3D"0" align=3D"left" style=3D"white-space:nowrap;border-radius:5px=\r\n;border-style:none;text-align:center;padding:0;margin:0;line-height:1px;fon=\r\nt-size:1px;">\r\n<tbody>\r\n<tr>\r\n<td class=3D"spacer" width=3D"30" style=3D"font-size:1px;font-size:1px;line=\r\n-height:1px;font-size:1px;padding:0;margin:0;line-height:1px;font-size:1px;=\r\n">&nbsp;</td>\r\n<td height=3D"40" align=3D"center" style=3D"padding:0;margin:0;line-height:=\r\n1px;font-size:1px;"> <a href=3D"https://twitter.com/i/redirect?url=3Dhttps%=\r\n3A%2F%2Ftwitter.com%2Faccount%2Fconfirm_user_email%2F3816909793%2F9CE5D-H4F=\r\n5D-144477%3Ft%3D1%26cn%3DZW1haWxfY2hhbmdlX25vdGljZV9uZXh0%26sig%3D69386bec1=\r\n102903b8e56a388d035a97f9d8e69f9%26al%3D1%26iid%3D9df2edd3ab1d4c49a5c9ac3a05=\r\n69baab%26ac%3D1%26autoactions%3D1444773433%26uid%3D3816909793%26nid%3D14%2B=\r\n308&amp;t=3D1&amp;cn=3DZW1haWxfY2hhbmdlX25vdGljZV9uZXh0&amp;sig=3D256cbf355=\r\n6df8db1580c37c1e032d1178f4d23a3&amp;iid=3D9df2edd3ab1d4c49a5c9ac3a0569baab&=\r\namp;uid=3D3816909793&amp;nid=3D14+308" style=3D"border-style:none;text-deco=\r\nration:none;color:#ffffff;-webkit-font-smoothing: antialiased;font-size:14p=\r\nx;letter-spacing:0.02em;font-weight:bold;white-space:nowrap;overflow:hidden=\r\n;padding:0px;margin:0px;font-family:\'Helvetica Neue\', Helvetica, Arial, san=\r\ns-serif;line-height:14px;text-decoration:none;border-style:none;border:0;pa=\r\ndding:0;margin:0;"> <span class=3D"" style=3D"border-style:none;text-decora=\r\ntion:none;color:#ffffff;line-height:100%">Confirm now</span> </a> </td>\r\n<td class=3D"spacer" width=3D"30" style=3D"font-size:1px;font-size:1px;line=\r\n-height:1px;font-size:1px;padding:0;margin:0;line-height:1px;font-size:1px;=\r\n">&nbsp;</td>\r\n</tr>\r\n</tbody>\r\n</table> </td>\r\n</tr>\r\n<!--*********** end button ************-->\r\n<tr>\r\n<td height=3D"44" style=3D"height:44px;padding:0;margin:0;line-height:1px;f=\r\nont-size:1px;"></td>\r\n</tr>\r\n</tbody>\r\n</table> </td>\r\n<td width=3D"50" style=3D"width:50px;padding:0;margin:0;line-height:1px;fon=\r\nt-size:1px;" class=3D"margins"></td>\r\n</tr>\r\n</tbody>\r\n</table>\r\n<!--//////////////////////////////////////////////-->\r\n<table class=3D"collapse" id=3D"footer" align=3D"center" width=3D"500" styl=\r\ne=3D"width:500px;background-color:#ffffff;padding:0;margin:0;line-height:1p=\r\nx;font-size:1px;" cellpadding=3D"0" cellspacing=3D"0" border=3D"0">\r\n<tbody>\r\n<tr>\r\n<td height=3D"1" style=3D"line-height:1px;display:block;height:1px;backgrou=\r\nnd-color:#e1e8ed;padding:0;margin:0;line-height:1px;font-size:1px;"></td>\r\n</tr>\r\n<tr>\r\n<td height=3D"20" style=3D"height:20;padding:0;margin:0;line-height:1px;fon=\r\nt-size:1px;"></td>\r\n</tr>\r\n<tr>\r\n<td align=3D"center" style=3D"padding:0;margin:0;line-height:1px;font-size:=\r\n1px;"> <span class=3D"footer_type" style=3D"font-family:\'Helvetica Neue Lig=\r\nht\', Helvetica, Arial, sans-serif;-webkit-font-smoothing:antialiased;color:=\r\n#8899a6;font-size:12px;padding:0px;margin:0px;font-weight:normal;line-heigh=\r\nt:12px;"> <a href=3D"https://twitter.com/i/redirect?url=3Dhttps%3A%2F%2Ftwi=\r\ntter.com%2Fi%2Fredirect%3Furl%3Dhttps%253A%252F%252Ftwitter.com%252Fsetting=\r\ns%252Fnotifications%253Fcn%253DZW1haWxfY2hhbmdlX25vdGljZV9uZXh0%26t%3D1%26c=\r\nn%3DZW1haWxfY2hhbmdlX25vdGljZV9uZXh0%26sig%3D3084a7eb53ea988c00b18e060fa6a6=\r\n023b0f5c36%26iid%3D9df2edd3ab1d4c49a5c9ac3a0569baab%26uid%3D3816909793%26ni=\r\nd%3D14%2B27&amp;t=3D1&amp;cn=3DZW1haWxfY2hhbmdlX25vdGljZV9uZXh0&amp;sig=3Da=\r\n53a86b7487b15c908170e0d06203350ad2e0745&amp;iid=3D9df2edd3ab1d4c49a5c9ac3a0=\r\n569baab&amp;uid=3D3816909793&amp;nid=3D14+1555" class=3D"footer_link" style=\r\n=3D"text-decoration:none;border-style:none;border:0;padding:0;margin:0;font=\r\n-family:\'Helvetica Neue Light\', Helvetica, Arial, sans-serif;-webkit-font-s=\r\nmoothing:antialiased;-webkit-text-size-adjust:none;color:#55acee;font-size:=\r\n12px;padding:0px;margin:0px;font-weight:600;line-height:12px;">Settings</a>=\r\n | <a href=3D"https://twitter.com/i/redirect?url=3Dhttps%3A%2F%2Fsupport.tw=\r\nitter.com%2F&amp;t=3D1&amp;cn=3DZW1haWxfY2hhbmdlX25vdGljZV9uZXh0&amp;sig=3D=\r\n1dfdf7cecb06258c7e6a41ca318ec4370f621673&amp;iid=3D9df2edd3ab1d4c49a5c9ac3a=\r\n0569baab&amp;uid=3D3816909793&amp;nid=3D14+1557" class=3D"footer_link" styl=\r\ne=3D"text-decoration:none;border-style:none;border:0;padding:0;margin:0;fon=\r\nt-family:\'Helvetica Neue Light\', Helvetica, Arial, sans-serif;-webkit-font-=\r\nsmoothing:antialiased;-webkit-text-size-adjust:none;color:#55acee;font-size=\r\n:12px;padding:0px;margin:0px;font-weight:600;line-height:12px;">Help</a> | =\r\n<a href=3D"https://twitter.com/i/u?t=3D1&amp;cn=3DZW1haWxfY2hhbmdlX25vdGljZ=\r\nV9uZXh0&amp;sig=3D638d06973cb368d673778db5c8414b594d5c6ed2&amp;iid=3D9df2ed=\r\nd3ab1d4c49a5c9ac3a0569baab&amp;uid=3D3816909793&amp;nid=3D14+26" class=3D"f=\r\nooter_link" style=3D"text-decoration:none;border-style:none;border:0;paddin=\r\ng:0;margin:0;font-family:\'Helvetica Neue Light\', Helvetica, Arial, sans-ser=\r\nif;-webkit-font-smoothing:antialiased;-webkit-text-size-adjust:none;color:#=\r\n55acee;font-size:12px;padding:0px;margin:0px;font-weight:600;line-height:12=\r\npx;">Opt-out</a> | <a href=3D"https://twitter.com/i/redirect?url=3Dhttps%3A=\r\n%2F%2Ftwitter.com%2Faccount%2Fnot_my_account%2F3816909793%2F9CE5D-H4F5D-144=\r\n477%3Fut%3D1%26cn%3DZW1haWxfY2hhbmdlX25vdGljZV9uZXh0&amp;t=3D1&amp;cn=3DZW1=\r\nhaWxfY2hhbmdlX25vdGljZV9uZXh0&amp;sig=3D0e2b07faf8b7cab119459e512ea58097f5b=\r\n8e82b&amp;iid=3D9df2edd3ab1d4c49a5c9ac3a0569baab&amp;uid=3D3816909793&amp;n=\r\nid=3D14+25" class=3D"footer_link" style=3D"text-decoration:none;border-styl=\r\ne:none;border:0;padding:0;margin:0;font-family:\'Helvetica Neue Light\', Helv=\r\netica, Arial, sans-serif;-webkit-font-smoothing:antialiased;-webkit-text-si=\r\nze-adjust:none;color:#55acee;font-size:12px;padding:0px;margin:0px;font-wei=\r\nght:600;line-height:12px;">Not my account</a> </span> </td>\r\n</tr>\r\n<tr>\r\n<td height=3D"10" style=3D"height:10px;line-height:1px;font-size:1px;paddin=\r\ng:0;margin:0;line-height:1px;font-size:1px;"></td>\r\n</tr>\r\n<tr>\r\n<td align=3D"center" style=3D"padding:0;margin:0;line-height:1px;font-size:=\r\n1px;"> <span class=3D"address"> <a href=3D"" style=3D"text-decoration:none;=\r\nborder-style:none;border:0;padding:0;margin:0;font-family:\'Helvetica Neue L=\r\night\', Helvetica, Arial, sans-serif;-webkit-font-smoothing:antialiased;colo=\r\nr:#8899a6;font-size:12px;padding:0px;margin:0px;font-weight:normal;line-hei=\r\nght:12px;cursor:default;">Twitter, Inc. 1355 Market Street, Suite 900 San F=\r\nrancisco, CA 94103</a> </span> </td>\r\n</tr>\r\n<tr>\r\n<td height=3D"26" style=3D"height:26;padding:0;margin:0;line-height:1px;fon=\r\nt-size:1px;"></td>\r\n</tr>\r\n</tbody>\r\n</table> <img width=3D"1" height=3D"1" style=3D"display: block;margin:0;pad=\r\nding:0;display:block;-ms-interpolation-mode:bicubic;border:none;outline:non=\r\ne;" src=3D"https://twitter.com/scribe/ibis?t=3D1&amp;cn=3DZW1haWxfY2hhbmdlX=\r\n25vdGljZV9uZXh0&amp;iid=3D9df2edd3ab1d4c49a5c9ac3a0569baab&amp;uid=3D381690=\r\n9793&amp;nid=3D14+20" />\r\n<!--//////////////////////////////////////////////--> </td>\r\n</tr>\r\n</tbody>\r\n</table>\r\n</body>\r\n</html>\r\n\r\n------=_Part_44683898_1221426234.1444773433942--\r\n')

我正在尝试提取必须点击的确认电子邮件:

https://twitter.com/i/redirect?url=https%3A%2F%2Ftwitter.com%2Faccount%2Fconfirm_user_email%2F3816909793%2F9CE5D-H4F5D-144477%3Ft%3D1%26cn%3DZW1haWxfY2hhbmdlX25vdGljZV9uZXh0%26sig%3Da6878f323b83b61ceb5eaa8fbdb2214d25fc65ahdgdga33%3D9df2edd3ab1d4c49a5c9ac3a0569baab%26ac%3D1%26autoactions%3D1444773433%26uid%3D3816909793%26nid%3D14%2B309&amp;t=1&amp;cn=ZW1haWxfY2hhbmdlX25vdGljZV9uZXh0&amp;sig=2b56e3a59dd6b182afaf3abxcc67d73&amp;iid=9df2edd3ab1d4c49a5c9ac3a0569baab&amp;uid=3816909793&amp;nid=14+309

使用regex101,我构建了this正则表达式,它似乎运行良好。然而,当我提取生成的Python代码时:

import re
p = re.compile(ur'(https.+)(\\r|\\n)')
test_str = (the full email text)

然后re.search(p, test_str)什么都不返回。和re.findall()一样。

为什么生成的Python代码不起作用,和/或是否有更好的正则表达式?注意:文本中有几个Twitter URL;我希望只匹配与“立即确认”按钮相关联的那个。

Python:2.7

5 个答案:

答案 0 :(得分:2)

在使用正则表达式或其他更合适的工具从电子邮件中提取数据之前,您应首先使用电子邮件解析器正确处理电子邮件。在Python中,我们提供了email.parser开箱即用的功能:

raw_content = 'Delivered-To: example@gmail.com...'

import email.parser
email_parser = email.parser.Parser()
email_content = email_parser.parsestr(raw_content)

def get_all_messages(email_message):
    stack = [email_message]
    messages = []
    while len(stack):
        msg = stack.pop()
        if msg.is_multipart():
            stack += msg.get_payload()
        else:
            messages.append(msg)
    return messages

messages = get_all_messages(email_content)

messages变量包含电子邮件中的各个部分。您可以选择使用正则表达式从text/plain消息中提取链接,或使用BeautifulSoup之类的HTML解析器从text/html消息中提取链接。

以下是从text/plain消息中提取链接的示例代码:

for msg in messages:
    if msg.get_content_type() == 'text/plain':
        import re
        # Decode the message according to Content-Transfer-Encoding
        # Then decode the text according to charset field in Content-Type header, fall back to UTF-8 if not specified
        payload = msg.get_payload(decode=True).decode(msg.get_content_charset('utf-8'))
        link = re.findall(ur'https?://.*', payload)

记下电话.get_payload(decode=True)。必须指定decode参数以根据Content-Transfer-Encoding标头解码有效负载。虽然在text/plain消息的情况下无关紧要,但它会影响text/html的正确性,因为在这种情况下,有效负载为quoted-printable

由于只有一个链接,上面的简单正则表达式就足够了。

在使用HTML解析器解析之前,您可以使用类似的代码来处理text/html消息的有效负载。解析HTML后,您可以选择所有<a>代码,并仅保留链接中包含confirm_user_email的代码。

答案 1 :(得分:1)

如果您正在使用字符串文字,请不要尝试转义\字符。因此,请删除开头的r

p = re.compile(u'(https.+)(\\r|\\n)')

或者不要使用双后背:

p = re.compile(ur'(https.+)(\r|\n)')

希望它有所帮助!

答案 2 :(得分:1)

尝试从正则表达式的开头删除“ur”。您也可以直接使用已编译的正则表达式作为对象来执行搜索。

试试这个:

import re
p = re.compile('(https.+)(\\r|\\n)')
test_str = (the full email text)
desired_string = p.search(test_str)
print desired_string.group(0)

答案 3 :(得分:1)

我会使用一个略有不同的正则表达式:

import re

with open('out') as f:  # out contains the page content
  content = f.read()

p = re.compile(u'"(https:.*?)"')

for m in re.findall(p, content):
  print m

.*?是非贪婪的匹配,将在第一个双引号处停止。

答案 4 :(得分:1)

result = re.findall(r"(https.*?)(?:\r|\n)", email, re.MULTILINE)
link = result[0]

实时Python演示

http://ideone.com/9R62Ug

正则表达式解释

(https.*?)(?:\r|\n)

Match the regex below and capture its match into backreference number 1 «(https.*?)»
   Match the character string “https” literally «https»
   Match any single character that is NOT a line break character «.*?»
      Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the regular expression below «(?:\r|\n)»
   Match this alternative «\r»
      Match the carriage return character «\r»
   Or match this alternative «\n»
      Match the line feed character «\n»