Python:检查HTML表格中是否有值

时间:2016-12-06 15:48:15

标签: python html beautifulsoup html-parsing

我使用BeautifulSoup从HTML表中提取特定数据,现在我需要检查该数据是否出现在表中的其他位置,如果是这样,也会捕获该数据。

更具体地说,我表格的最后一列只包含日期。最近的日期位于列的顶部,此日期有时会在较低的行中重新出现。我正在尝试编写一个if循环来查看:如果该日期出现在表中的其他位置,则捕获它并在其行中显示相应的数据。

以下是我的相关代码:

`rows = table.findAll("tr")[0:20]
 more_data = {'last_column' : []}
 for row in rows [1:2]:
     cols = row.findAll("td")[0:20]
 more_data['last_column'].extend(cols[4].stripped_strings)
 var1 = more_data['last_column']             #stores date 5/20 in variable
 str1 = ''.join(var1)                        #turns 5/20 from list to string`


`data_data = {'entire_last_column' : [] }
 for row in rows [0:20]:
     cols = row.findAll("td")[0:20]
     data_data['entire_last_column'].extend(cols[4].stripped_strings)
 var2 = data_data['entire_last_column']     #stores entire last column of dates in variable `


`#var2.count(str1)

if any(var1 in s for s in var2):
    print(s)`

最后几行代码是我尝试搜索和捕获未来出现的' 5/20'。但即使是那些失败。

我需要核实是否' 5/20'在表中再次出现,如果是,那么我需要该行的所有数据,如果它没有结束。

以下是HTML代码:

`<html>
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>FW: NEFS 2 Available Quota 5/21</title>
<link rel="important stylesheet" href="">
<style>div.headerdisplayname {font-weight:bold;}</style></head>
<body>
<table border=0 cellspacing=0 cellpadding=0 width="100%" class="header-part1"><tr><td><b>Subject: </b>FW: NEFS 2 Available Quota 5/21</td></tr><tr><td><b>From: </b>Claire Fitz-Gerald <claire@capecodfishermen.org></td></tr><tr><td><b>Date: </b>5/21/2014 10:08 AM</td></tr></table><br>
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><META HTTP-EQUIV="Content-Type" CONTENT="text/html; "><meta name=Generator content="Microsoft Word 12 (filtered medium)"><!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]--><style><!--
/* Font Definitions */
@font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;
    panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
    {font-family:Tahoma;
    panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
    {font-family:"Franklin Gothic Book";
    panose-1:2 11 5 3 2 1 2 2 2 4;}
@font-face
    {font-family:"Franklin Gothic Demi";
    panose-1:2 11 7 3 2 1 2 2 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
    {margin:0in;
    margin-bottom:.0001pt;
    font-size:11.0pt;
    font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
    {mso-style-priority:99;
    color:blue;
    text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
    {mso-style-priority:99;
    color:purple;
    text-decoration:underline;}
span.EmailStyle17
    {mso-style-type:personal;
    font-family:"Calibri","sans-serif";
    color:windowtext;}
span.title1
    {mso-style-name:title1;
    font-family:"Arial","sans-serif";
    color:#1F487E;
    font-weight:normal;}
span.EmailStyle19
    {mso-style-type:personal-reply;
    font-family:"Calibri","sans-serif";
    color:#1F497D;}
.MsoChpDefault
    {mso-style-type:export-only;
    font-size:10.0pt;}
@page WordSection1
    {size:8.5in 11.0in;
    margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
    {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=EN-US link=blue vlink=purple><div class=WordSection1><p class=MsoNormal><span style='color:#1F497D'>Please see the below quota listings.<o:p></o:p></span></p><p class=MsoNormal><span style='color:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><span style='color:#1F497D'>Thanks,<o:p></o:p></span></p><p class=MsoNormal><span style='color:#1F497D'><o:p>&nbsp;</o:p></span></p><div><p class=MsoNormal><span style='font-size:12.0pt;font-family:"Franklin Gothic Book","sans-serif";color:#1F497D'>Claire Fitz-Gerald<o:p></o:p></span></p><p class=MsoNormal><i><span style='font-size:10.0pt;font-family:"Franklin Gothic Book","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></i></p><p class=MsoNormal><b><span style='font-family:"Franklin Gothic Demi","sans-serif";color:#002776'>Cape Cod Commercial Fishermen's Alliance<o:p></o:p></span></b></p><p class=MsoNormal><b><span style='font-family:"Franklin Gothic Book","sans-serif";color:#DE3500'>~ Small Boats.&nbsp; Big Ideas. ~</span></b><b><span style='color:#DE3500'><o:p></o:p></span></b></p></div><p class=MsoNormal><span style='color:#1F497D'><o:p>&nbsp;</o:p></span></p><div><div style='border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in'><p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span></b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'> David Leveille [mailto:nefs02@gmail.com] <br><b>Sent:</b> Wednesday, May 21, 2014 8:50 AM<br><b>To:</b> David Leveille<br><b>Subject:</b> NEFS 2 Available Quota 5/21<o:p></o:p></span></p></div></div><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal><span style='font-size:12.0pt;font-family:"Arial","sans-serif";color:#1F487E'>AVAILABLE QUOTA FY 2014</span><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p></o:p></span></p><table class=MsoNormalTable border=0 cellspacing=0 cellpadding=0 width="71%" style='width:71.28%'><tr><td width=220 style='width:164.95pt;border:none;border-bottom:solid windowtext 1.0pt;background:#8BCDFF;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><b><span style='font-size:9.0pt;font-family:"Arial","sans-serif";color:black'>ID <o:p></o:p></span></b></p></td><td width=161 style='width:120.75pt;border:none;border-bottom:solid windowtext 1.0pt;background:#8BCDFF;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='mso-line-height-alt:15.0pt'><b><span style='font-size:18.0pt;font-family:"Arial","sans-serif";color:black'>Available Quota <o:p></o:p></span></b></p></td><td width=189 style='width:141.75pt;border:none;border-bottom:solid windowtext 1.0pt;background:#8BCDFF;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='mso-line-height-alt:15.0pt'><b><span style='font-size:18.0pt;font-family:"Arial","sans-serif";color:black'>Live Weight Pounds <o:p></o:p></span></b></p></td><td width=126 style='width:94.55pt;border:none;border-bottom:solid windowtext 1.0pt;background:#8BCDFF;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='mso-line-height-alt:15.0pt'><b><span style='font-size:18.0pt;font-family:"Arial","sans-serif";color:black'>Price <o:p></o:p></span></b></p></td><td width=168 style='width:125.95pt;border:none;border-bottom:solid windowtext 1.0pt;background:#8BCDFF;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='mso-line-height-alt:15.0pt'><b><span style='font-size:18.0pt;font-family:"Arial","sans-serif";color:black'>Date Posted <o:p></o:p></span></b></p></td></tr><tr><td width=220 style='width:164.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>1724<o:p></o:p></span></p></td><td width=161 style='width:120.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>GOM COD<br>GOM HADD<br>GOM BB<br>GREYSOLE<br>DABS<br>GOM YT<o:p></o:p></span></p></td><td width=189 style='width:141.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>2328<br>445<br>3007<br>850<br>3101<br>1995<o:p></o:p></span></p></td><td width=126 style='width:94.55pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>Package<o:p></o:p></span></p><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>$9,000<o:p></o:p></span></p></td><td width=168 style='width:125.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>5/20<o:p></o:p></span></p></td></tr><tr><td width=220 style='width:164.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>1578<o:p></o:p></span></p></td><td width=161 style='width:120.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>GBE COD<br>GBW COD<br>GB BB<br>GB YT<br>SNE BB<br>SNE YT<br>GOM BB<br>Whake<br>POLL<br>RED<o:p></o:p></span></p></td><td width=189 style='width:141.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>538<br>5894<br>1755<br>243<br>490<br>153<br>3965<br>2727<br>9227<br>15060<o:p></o:p></span></p></td><td width=126 style='width:94.55pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>$1.00<br>$0.40<br>$0.20<br>$1.00<br>$0.45<br>$0.50<br>$0.15<br>$0.20<br>$0.01<br>$0.01<o:p></o:p></span></p></td><td width=168 style='width:125.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>5/20<o:p></o:p></span></p></td></tr><tr><td width=220 style='width:164.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>310<o:p></o:p></span></p></td><td width=161 style='width:120.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>GBE COD<br>GBW COD<br>DABS<br>WHAKE<br>POLL<br>RED<br>SNE BB<br>GOM BB<o:p></o:p></span></p></td><td width=189 style='width:141.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>825<br>9033<br>1241<br>3120<br>65234<br>76610<br>1688<br>1195<br>2121<br>7285<o:p></o:p></span></p></td><td width=126 style='width:94.55pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>Package<o:p></o:p></span></p><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>$15,000<o:p></o:p></span></p></td><td width=168 style='width:125.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>5/20<o:p></o:p></span></p></td></tr><tr style='height:23.25pt'><td width=220 style='width:164.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt;height:23.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>347<o:p></o:p></span></p></td><td width=161 style='width:120.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt;height:23.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>SNE BB<o:p></o:p></span></p></td><td width=189 style='width:141.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt;height:23.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>8,000<o:p></o:p></span></p></td><td width=126 style='width:94.55pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt;height:23.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>$0.50<o:p></o:p></span></p></td><td width=168 style='width:125.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt;height:23.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>5/7<o:p></o:p></span></p></td></tr><tr><td width=220 style='width:164.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>1878A<o:p></o:p></span></p></td><td width=161 style='width:120.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>GOM COD<br>GOM HADD<br>SNE BB<br>GOM BB<br>GB BB<br>GREYSOLE<br>GOM YT<br>SNE YT<br>POLL<o:p></o:p></span></p></td><td width=189 style='width:141.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>6188<br>635<br>3916<br>7873<br>6762<br>3358<br>9776<br>271<br>186550<o:p></o:p></span></p></td><td width=126 style='width:94.55pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>$1.95<br>$1.35<br>$0.50<br>$0.50<br>$0.20<br>$1.40<br>$1.20<br>$0.50<br>$0.01<o:p></o:p></span></p></td><td width=168 style='width:125.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>5/12<o:p></o:p></span></p></td></tr><tr><td width=220 style='width:164.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>1878B<o:p></o:p></span></p></td><td width=161 style='width:120.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>GBE COD<br>GBW COD<br>GB YT<o:p></o:p></span></p></td><td width=189 style='width:141.75pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>1113<br>12186<br>850<o:p></o:p></span></p></td><td width=126 style='width:94.55pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>Package<br>$10,000<o:p></o:p></span></p></td><td width=168 style='width:125.95pt;border:solid windowtext 1.0pt;background:white;padding:2.25pt 2.25pt 2.25pt 2.25pt'><p class=MsoNormal style='line-height:15.0pt'><span style='font-size:13.5pt;font-family:"Arial","sans-serif";color:black'>5/12<o:p></o:p></span></p></td></tr></table><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p class=MsoNormal>David Leveille<o:p></o:p></p><p class=MsoNormal>II Northeast Fishery Sector Inc.<o:p></o:p></p><p class=MsoNormal>10 Witham Street<o:p></o:p></p><p class=MsoNormal>Gloucester, MA. 01930<o:p></o:p></p><p class=MsoNormal>Cell 978 375 3509<o:p></o:p></p><p class=MsoNormal>Fax 978 281 1555<o:p></o:p></p><p class=MsoNormal>Web <a href="http://nefs2.com/">http://nefs2.com/</a><o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><div class=MsoNormal align=center style='text-align:center'><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'></body></html>
</body>
</html>
`

Screenshot of entire data table

1 个答案:

答案 0 :(得分:0)

我会找到所有表格行并使用max() function和自定义key函数来确定“最新”记录。自定义函数会在内部单元格中找到日期,并将其转换为datetime.date(),以便将日期相互比较:

import re
from datetime import date

from bs4 import BeautifulSoup


data = """your HTML here"""    

soup = BeautifulSoup(data, "html.parser")

# get the table header label
label = soup.find("span", text=re.compile(r"^AVAILABLE QUOTA FY"))

# extract the year
year = int(re.search(r"\d{4}", label.get_text()).group(0))

table = label.find_next("table")
rows = table.find_all("tr")[1:]  # skipping header row


def get_date(row):
    last_column_cell = row.find_all("td")[-1]

    # extract month and day
    month, day = last_column_cell.get_text().split("/")
    return date(year=year, month=int(month), day=int(day))


# filter the row with the latest date
row = max(rows, key=get_date)

# print the value of every cell in the row
print([td.get_text(strip=True) for td in row.find_all("td")])

代码可能缺少错误处理,应该进一步改进,但它适用于提供的示例数据,打印:

['1724', 'GOM CODGOM HADDGOM BBGREYSOLEDABSGOM YT', '2328445300785031011995', 'Package$9,000', '5/20']

请注意,这只会找到具有最大日期的行。如果你想进一步 - 找到这个日期的所有行:

# get the value of the latest cell
latest_date_string = row.find_all("td")[-1].get_text().strip()

# print the value of every cell in the row
print([[td.get_text(strip=True) for td in date_string.find_parent("tr").find_all("td")]
       for date_string in table.find_all(text=latest_date_string)])