如何从HTML文件中打印一行文本

时间:2016-12-14 15:49:14

标签: python html regex

我的目标是使用RegEx扫描电子邮件中的单词" trade"然后打印找到它的整行。

我成功使用RegEx捕获此HTML文档中的其他数据(例如物种,重量,价格等),以及成功识别单词" trade",但是我没有打印出它所在的整条生产线。我确实尝试使用BeautifulSoup来实现这个目标,但是这样做有很多困难。

Here is the document I believe is in HTML format (correct me if I'm wrong and it's not HTML):

理想情况下,我想捕捉和打印“" trade"找到了。以下是我用来尝试识别" trade"的代码。并打印它上面的行:

with open(file_path, 'r') as f:
        email = f.read()
        pattern = re.search(r'\btrade\b',email).group(0)
        match = re.search(r'\btrade\b', email)
        if match:
            for line in email:
                print("TRADE STUFF:",line)

请注意,我尝试过各种方法,例如print("TRADE STUFF:", line.splitlines())以及print("TRADE STUF:", line.stripped_strings),但都没有成功。

感谢您的帮助。

HTML code:

<html>
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>FW: NEFS 5 Available Fish</title>
<link rel="important stylesheet" href="">
<style>div.headerdisplayname {font-weight:bold;}</style></head>
<body>
<table border=0 cellspacing=0 cellpadding=0 width="100%" class="header-part1"><tr><td><b>Subject: </b>FW: NEFS 5 Available Fish</td></tr><tr><td><b>From: </b>Claire Fitz-Gerald <claire@capecodfishermen.org></td></tr><tr><td><b>Date: </b>9/5/2014 9:52 AM</td></tr></table><br>
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><META HTTP-EQUIV="Content-Type" CONTENT="text/html; "><meta name=Generator content="Microsoft Word 12 (filtered medium)"><!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]--><style><!--
/* Font Definitions */
@font-face
    {font-family:Wingdings;
    panose-1:5 0 0 0 0 0 0 0 0 0;}
@font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;
    panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
    {font-family:Tahoma;
    panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
    {font-family:"Franklin Gothic Book";
    panose-1:2 11 5 3 2 1 2 2 2 4;}
@font-face
    {font-family:"Franklin Gothic Demi";
    panose-1:2 11 7 3 2 1 2 2 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
    {margin:0in;
    margin-bottom:.0001pt;
    font-size:12.0pt;
    font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
    {mso-style-priority:99;
    color:blue;
    text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
    {mso-style-priority:99;
    color:purple;
    text-decoration:underline;}
span.EmailStyle18
    {mso-style-type:personal-reply;
    font-family:"Calibri","sans-serif";
    color:#1F497D;}
.MsoChpDefault
    {mso-style-type:export-only;
    font-size:10.0pt;}
@page WordSection1
    {size:8.5in 11.0in;
    margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
    {page:WordSection1;}
/* List Definitions */
@list l0
    {mso-list-id:1512259006;
    mso-list-template-ids:-893643712;}
@list l0:level1
    {mso-level-number-format:bullet;
    mso-level-text:\F0B7;
    mso-level-tab-stop:.5in;
    mso-level-number-position:left;
    text-indent:-.25in;
    mso-ansi-font-size:10.0pt;
    font-family:Symbol;}
ol
    {margin-bottom:0in;}
ul
    {margin-bottom:0in;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=EN-US link=blue vlink=purple><div class=WordSection1><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Apologies for the delay in distributing this listing.&nbsp; It got lost in my inbox.<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Please see the below quota listings.<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Thanks,<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><div><p class=MsoNormal><span style='font-family:"Franklin Gothic Book","sans-serif";color:#1F497D'>Claire Fitz-Gerald<o:p></o:p></span></p><p class=MsoNormal><i><span style='font-size:10.0pt;font-family:"Franklin Gothic Book","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></i></p><p class=MsoNormal><b><span style='font-size:11.0pt;font-family:"Franklin Gothic Demi","sans-serif";color:#002776'>Cape Cod Commercial Fishermen's Alliance<o:p></o:p></span></b></p><p class=MsoNormal><b><span style='font-size:11.0pt;font-family:"Franklin Gothic Book","sans-serif";color:#DE3500'>~ Small Boats.&nbsp; Big Ideas. ~</span></b><b><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#DE3500'><o:p></o:p></span></b></p></div><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><div><div style='border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in'><p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span></b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'> NEFS V [mailto:nefsector5@gmail.com] <br><b>Sent:</b> Monday, September 01, 2014 8:46 PM<br><b>To:</b> mike walsh - 6; NEFS 11 &amp; 12 - Josh Wiersma; NEFS 13 John Haran; NEFS 2 - Dave Leveille; NEFS 3 - Rob Banks; NEFS 6 &amp; 10 Jim Reardon; NEFS 7 &amp; 8 - Linda MaCann; NEFS 9 - Stephanie Rafael-DeMello; paula lynch - 10; Claire Fitz-Gerald; Sector - MCCS; Sector - NCCS; Sector - Sustainable Harvest; tory bramante- 6<br><b>Subject:</b> NEFS 5 Available Fish<o:p></o:p></span></p></div></div><p class=MsoNormal><o:p>&nbsp;</o:p></p><div><p class=MsoNormal>All,<br>NEFS 5 has the following fish available for lease/trade:<o:p></o:p></p></div><div><ul type=disc><li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1'><strong><span style='font-size:13.5pt'>GB EAST cod: 954 lbs @ $0.83</span></strong><o:p></o:p></li><li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1'><strong><span style='font-size:13.5pt'>GB EAST cod: 1,046 lbs to trade for 1,830 lbs GB WEST cod</span></strong><o:p></o:p></li><li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1'><strong><span style='font-size:13.5pt'>GB blackback: 30,000 lbs @ $0.07</span></strong><o:p></o:p></li><li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1'><strong><span style='font-size:13.5pt'>GOM blackback: 800 lbs @ $0.03</span></strong><o:p></o:p></li><li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1'><strong><span style='font-size:13.5pt'>white hake: 6,322 lbs @ $0.13</span></strong><o:p></o:p></li><li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1'><strong><span style='font-size:13.5pt'>pollock: 22,000 lbs @ $0.015</span></strong><o:p></o:p></li><li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1'><strong><span style='font-size:13.5pt'>redfish: 14,000 lbs @ $0.015</span></strong><o:p></o:p></li><li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1'><strong><span style='font-size:13.5pt'>GB yt: 1,873 lbs @ $1.13</span></strong><o:p></o:p></li><li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1'><strong><span style='font-size:13.5pt'>GB yt: 5,127 lbs to trade for 10,254 lbs SNE yt</span></strong><o:p></o:p></li></ul><div><p class=MsoNormal>&nbsp;<o:p></o:p></p></div><div><p class=MsoNormal>-- <o:p></o:p></p></div><div><p class=MsoNormal>&nbsp;<o:p></o:p></p></div></div><div><p class=MsoNormal>Daniel Salerno, NEFS 5<o:p></o:p></p></div><div><p class=MsoNormal>C/O NESTCo.<o:p></o:p></p></div><div><p class=MsoNormal>55 State Street<o:p></o:p></p></div><div><p class=MsoNormal>Narragansett, RI 02882<o:p></o:p></p></div><div><p class=MsoNormal>401-932-0070<o:p></o:p></p></div><div><p class=MsoNormal>401-633-6539 (fax)<o:p></o:p></p></div><div><p class=MsoNormal><a href="mailto:nefsector5@gmail.com" target="_blank">nefsector5@gmail.com</a><o:p></o:p></p></div><div class=MsoNormal align=center style='text-align:center'></body></html>
</body>
</html>

2 个答案:

答案 0 :(得分:1)

我会这样做:

with open(file_path, 'r') as f:
   while 1:
      line=f.readline()
      if not line:
         break
      if "trade" in line.lower():
         tags=line.replace('>','<').split('<')
         for tag in tags:
            if "trade" in tag.lower():
               print("TRADE STUFF: ",tag.strip())

答案 1 :(得分:0)

切换'for'循环和'if'语句,如下所示:

for line in email:
    if match:
       print("TRADE STUFF: ", line)