htmlparse没有清除<style> </style>

时间:2015-04-08 08:09:24

标签: python regex python-2.7 parsing html-email

我的html解析器有问题。我将填充了html代码的电子邮件转换为漂亮的干净文本,除了&#34;&lt;风格&gt;内容&lt; / style&gt;&#34;部分,它完全忽略了,我不知道我做错了什么:

    # Remove any HTML code from our raw content
htmlparse = re.sub(r'<.*?>', '', clean) \
    .replace("&nbsp;", '') \
    .replace("é", 'é') \
    .replace("è", 'è') \
clean_email = htmlparse

实际上应该删除的是:

&#13;
&#13;
<style>      .MailHeader      {          font: normal 10pt Tahoma, Verdana, Sans-Serif;          vertical-align: top;          padding-bottom: 0px;          padding-top: 0px;          spacing: 0px 0px 0px 0px;      }      .DataHeader      {          font-family: Tahoma;          font-size: 8pt;          color: #666666;          text-decoration: none;          padding-left: 15px;          padding-right: 15px;          border: solid 1px #E0E0E0;          vertical-align: text-top;      }      .Data      {          font: normal 8pt Tahoma,Verdana;          padding-left: 3px;          padding-right: 15px;          border: solid 1px #E0E0E0;          background: #F9F9F9;          font-size: 8pt;          color: #666666;          height: 20px !important;      }      .GridHeader      {          font: normal 8pt Tahoma,Verdana;          padding-left: 6px;          background: #DAEBFF;          height: 20px;      }      .DataRow      {          padding-left: 3px;          padding-right: 15px;          border: solid 1px #E0E0E0;          font-size: 8pt;          color: #003399;      }      .GridRow      {          font: normal 8pt Tahoma, Verdana, Sans-serif;          padding-left: 6px;          background: transparent;          height: 20px !important;          min-height: 1%;      }      .GridAltRow      {          font: normal 8pt Tahoma, Verdana, Sans-serif;          padding-left: 6px;          background: #F9F9F9;          height: 20px !important;          min-height: 1%;      }      .li      {          font: normal 10pt Tahoma, Verdana, Sans-Serif;          vertical-align: top;          padding-left: 5px;      }      .TableHeader      {          font-family: Tahoma,calibri,verdana;          font-size: 8pt;          font-weight: bold;          height: 22px;          color: #003399;          border: solid 1px #E0E0E0;          border-collapse: collapse;          padding-left: 5px;          padding-right: 5px;          background-color: #BBD8FF;      }      .TableSubHeader      {          font: normal 8pt Tahoma, Verdana, Sans-Serif;          vertical-align: middle;          padding-left: 3px;          font-weight: bold;          color: #666666;      }      .Separator      {          background-repeat: repeat-x;          background-position: center;          background: #666666;      }      .tableDetail      {          padding: 0 0 0 0;          spacing: 0 0 0 0;          border-collapse: collapse;          width: 750px;          margin-left: 5px;          border: solid 1px #E0E0E0;      }      .style1      {          font: normal 10pt Tahoma, Verdana, Sans-Serif;          vertical-align: top;          padding-bottom: 0px;          padding-top: 0px;          spacing: 0px 0px 0px 0px;          height: 18px;      }  </style>
&#13;
&#13;
&#13;

它实际上做的是删除样式和/样式,但将样式表的整个垃圾留在解析文件中......

  

.MailHeader {font:normal 10pt Tahoma,Verdana,Sans-Serif; vertical-align:top; padding-bottom:0px; padding-top:0px;间距:0px 0px 0px 0px; } .DataHeader {font-family:Tahoma; font-size:8pt;颜色:#666666; text-decoration:none; padding-left:15px; padding-right:15px; border:solid 1px#E0E0E0; vertical-align:text-top; }。{{font:normal 8pt Tahoma,Verdana; padding-left:3px; padding-right:15px; border:solid 1px#E0E0E0;背景:#F9F9F9; font-size:8pt;颜色:#666666;身高:20px!重要; } .GridHeader {font:normal 8pt Tahoma,Verdana; padding-left:6px;背景:#DAEBFF;身高:20px; } .DataRow {padding-left:3px; padding-right:15px; border:solid 1px#E0E0E0; font-size:8pt;颜色:#003399; } .GridRow {font:normal 8pt Tahoma,Verdana,Sans-serif; padding-left:6px;背景:透明;身高:20px!重要;最小高度:1%; } .GridAltRow {font:normal 8pt Tahoma,Verdana,Sans-serif; padding-left:6px;背景:#F9F9F9;身高:20px!重要;最小高度:1%; }。{{font:normal 10pt Tahoma,Verdana,Sans-Serif; vertical-align:top; padding-left:5px; } .TableHeader {font-family:Tahoma,calibri,verdana; font-size:8pt; font-weight:bold;身高:22px;颜色:#003399; border:solid 1px#E0E0E0;边界崩溃:崩溃; padding-left:5px; padding-right:5px; background-color:#BBD8FF; } .TableSubHeader {font:normal 8pt Tahoma,Verdana,Sans-Serif; vertical-align:middle; padding-left:3px; font-weight:bold;颜色:#666666; } .Separator {background-repeat:repeat-x;背景位置:中心;背景:#666666; } .tableDetail {padding:0 0 0 0;间距:0 0 0 0;边界崩溃:崩溃;宽度:750px; margin-left:5px; border:solid 1px#E0E0E0; } .style1 {font:normal 10pt Tahoma,Verdana,Sans-Serif; vertical-align:top; padding-bottom:0px; padding-top:0px;间距:0px 0px 0px 0px;身高:18px;你好,这是一个测试邮件。

任何人都可以帮助我吗?

提前谢谢你们, 问候

2 个答案:

答案 0 :(得分:1)

首先删除样式本身,然后在第二遍中,执行您想要执行的操作。

import re

some = """
<style>.MailHeader { font: normal 10pt Tahoma, Verdana, Sans-Serif;
vertical-align: top; padding-bottom: 0px; padding-top: 0px; spacing: 0px 0px 0px 0px; } 
.DataHeader { font-family: Tahoma; font-size: 8pt; color: #666666; text-decoration: none;
padding-left: 15px; padding-right: 15px; border: solid 1px #E0E0E0; vertical-align: text-top; } 
.Data { font: normal 8pt Tahoma,Verdana; padding-left: 3px; padding-right: 15px; border: solid 1px #E0E0E0;
\ background: #F9F9F9; font-size: 8pt; color: #666666; height: 20px !important; } 
.GridHeader { font: normal 8pt Tahoma,Verdana; padding-left: 6px; background: #DAEBFF; height: 20px; }
.DataRow { padding-left: 3px; padding-right: 15px; border: solid 1px #E0E0E0; font-size: 8pt; color: #003399; }
.GridRow { font: normal 8pt Tahoma, Verdana, Sans-serif; padding-left: 6px; background: transparent; 
height: 20px !important; min-height: 1%; } .GridAltRow { font: normal 8pt Tahoma, Verdana, Sans-serif;
padding-left: 6px; background: #F9F9F9; height: 20px !important; min-height: 1%; } 
.li { font: normal 10pt Tahoma, Verdana, Sans-Serif; vertical-align: top; padding-left: 5px; }
.TableHeader { font-family: Tahoma,calibri,verdana; font-size: 8pt; font-weight: bold; height: 22px;
color: #003399; border: solid 1px #E0E0E0; border-collapse: collapse; padding-left: 5px; 
padding-right: 5px; background-color: #BBD8FF; } 
.TableSubHeader { font: normal 8pt Tahoma, Verdana, Sans-Serif;
vertical-align: middle; padding-left: 3px; font-weight: bold; color: #666666; }
.Separator { background-repeat: repeat-x; background-position: center; background: #666666; }
.tableDetail { padding: 0 0 0 0; 
spacing: 0 0 0 0; border-collapse: collapse; width: 750px; margin-left: 5px; border: solid 1px #E0E0E0; }
.style1 { font: normal 10pt Tahoma, Verdana, Sans-Serif; vertical-align: top; padding-bottom:
0px; padding-top: 0px; spacing: 0px 0px 0px 0px; height: 18px; }
</style>
<h1>Hello, this is a test mail.</h1>
"""

some1 = re.sub(r'<style>.*</style>', '', some, flags=re.DOTALL)

print some1

结果:

I have no name!@sla-334:~/stack_o$ python stack_o_html.py 


<h1>Hello, this is a test mail.</h1>

现在,用你的html做你想做的事。

答案 1 :(得分:1)

我通过解析解析后的文本并将其作为参数

来解决这个问题
cleaner = re.sub(r'{.*}', '', clean_email)\
      .replace(".MailHeader", '') \

我试用你的解决方案