我目前有一个BeautifulSoup HTML项目列表,我通过以下方法调用获得:
tables = HTML.findAll("table", {"class": "datadisplaytable"})
这只会返回HTML文档中与查询匹配的所有表。这一切都很好,但它返回重复的表(如下面的输出所示)。
我尝试过删除重复项:
tables = list(set(HTML.findAll("table", {"class": "datadisplaytable"})))
它会删除重复项,但不会保留我需要的顺序。
所以我尝试了这个:
holder = []
for item in tables:
if item not in holder:
holder.append(item)
然而,重复仍然存在。以上方法是否无法处理BeautifulSoup HTML?如果没有,如何删除BeautifulSoup HTML重复项并保留订单?
修改
tables = OrderedDict.fromkeys(HTML.findAll("table", {"class": "datadisplaytable"})).keys()
然后在打印时,它是免费的:
for item in tables:
print "\n\n\n"
print item
但是,当我尝试打印时执行以下操作,重复的内容又回来了。我疯了吗?
i = 0
while (i < len(tables)-1):
print "\n\nitem[i]: \n", tables[i]
print "\n\nitem[i+1]: \n", tables[i+1]
i += 1
有什么想法吗?
item[i]:
<table class="datadisplaytable" summary="This table lists the scheduled meeting times and assigned instructors for this class.."><caption class="captiontext">Scheduled Meeting Times</caption>
<tbody><tr>
<th class="ddheader" scope="col">Type</th>
<th class="ddheader" scope="col">Time</th>
<th class="ddheader" scope="col">Days</th>
<th class="ddheader" scope="col">Where</th>
<th class="ddheader" scope="col">Date Range</th>
<th class="ddheader" scope="col">Schedule Type</th>
<th class="ddheader" scope="col">Instructors</th>
</tr>
<tr>
<td class="dddefault">Class</td>
<td class="dddefault">2:00 pm - 3:15 pm</td>
<td class="dddefault">MWF</td>
<td class="dddefault">Manchester Hall 241</td>
<td class="dddefault">Jan 13, 2015 - May 07, 2015</td>
<td class="dddefault">Lecture</td>
<td class="dddefault">William H. Turkett (<abbr title="Primary">P</abbr>)<a href="mailto:turketwh@wfu.edu" target="William H. Turkett"><img align="middle" alt="E-mail" border="0" class="headerImg" height="28" hspace="0" name="web_email" src="/wtlgifs/web_email.gif" title="E-mail" vspace="0" width="28"/></a></td>
</tr>
</tbody></table>
item[i+1]:
<table class="datadisplaytable" summary="This layout table is used to present the schedule course detail"><caption class="captiontext">Linear Algebra I - MTH 121 - C</caption>
<tbody><tr>
<th class="ddlabel" colspan="2" scope="row">Associated Term:</th>
<td class="dddefault">Spring 2015</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row"><acronym title="Course Reference Number">CRN</acronym>:</th>
<td class="dddefault">19765</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row">Status:</th>
<td class="dddefault">**Web Registered** on Nov 05, 2014</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row">Assigned Instructor:</th>
<td class="dddefault">
Jason D. Gaddis<a href="mailto:gaddisjd@wfu.edu" target="Jason D. Gaddis"><img align="middle" alt="E-mail" border="0" class="headerImg" height="28" hspace="0" name="web_email" src="/wtlgifs/web_email.gif" title="E-mail" vspace="0" width="28"/></a>
</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row">Grade Mode:</th>
<td class="dddefault">Standard Letter</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row">Credits:</th>
<td class="dddefault"> 4.000</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row">Level:</th>
<td class="dddefault">Undergraduate</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row">Campus:</th>
<td class="dddefault">Reynolda Campus (UG)</td>
</tr>
</tbody></table>
item[i]:
<table class="datadisplaytable" summary="This layout table is used to present the schedule course detail"><caption class="captiontext">Linear Algebra I - MTH 121 - C</caption>
<tbody><tr>
<th class="ddlabel" colspan="2" scope="row">Associated Term:</th>
<td class="dddefault">Spring 2015</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row"><acronym title="Course Reference Number">CRN</acronym>:</th>
<td class="dddefault">19765</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row">Status:</th>
<td class="dddefault">**Web Registered** on Nov 05, 2014</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row">Assigned Instructor:</th>
<td class="dddefault">
Jason D. Gaddis<a href="mailto:gaddisjd@wfu.edu" target="Jason D. Gaddis"><img align="middle" alt="E-mail" border="0" class="headerImg" height="28" hspace="0" name="web_email" src="/wtlgifs/web_email.gif" title="E-mail" vspace="0" width="28"/></a>
</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row">Grade Mode:</th>
<td class="dddefault">Standard Letter</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row">Credits:</th>
<td class="dddefault"> 4.000</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row">Level:</th>
<td class="dddefault">Undergraduate</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row">Campus:</th>
<td class="dddefault">Reynolda Campus (UG)</td>
</tr>
</tbody></table>
item[i+1]:
<table class="datadisplaytable" summary="This table lists the scheduled meeting times and assigned instructors for this class.."><caption class="captiontext">Scheduled Meeting Times</caption>
<tbody><tr>
<th class="ddheader" scope="col">Type</th>
<th class="ddheader" scope="col">Time</th>
<th class="ddheader" scope="col">Days</th>
<th class="ddheader" scope="col">Where</th>
<th class="ddheader" scope="col">Date Range</th>
<th class="ddheader" scope="col">Schedule Type</th>
<th class="ddheader" scope="col">Instructors</th>
</tr>
<tr>
<td class="dddefault">Class</td>
<td class="dddefault">12:30 pm - 1:45 pm</td>
<td class="dddefault">MWF</td>
<td class="dddefault">Carswell Hall 101</td>
<td class="dddefault">Jan 13, 2015 - May 07, 2015</td>
<td class="dddefault">Lecture</td>
<td class="dddefault">Jason Dale Gaddis (<abbr title="Primary">P</abbr>)<a href="mailto:gaddisjd@wfu.edu" target="Jason D. Gaddis"><img align="middle" alt="E-mail" border="0" class="headerImg" height="28" hspace="0" name="web_email" src="/wtlgifs/web_email.gif" title="E-mail" vspace="0" width="28"/></a></td>
</tr>
</tbody></table>
答案 0 :(得分:0)
相反,我依赖于桌子上独一无二的东西。例如,在summary
属性:
summaries = set()
tables = []
for table in soup.find_all("table", {"class": "datadisplaytable"}):
summary = table['summary']
if summary not in summaries:
summaries.add(summary)
tables.append(table)