为什么我不能在Python中删除此列表(BeautifulSoup HTML)中的重复项?

时间:2015-03-12 23:32:02

标签: python duplicates beautifulsoup

我目前有一个BeautifulSoup HTML项目列表,我通过以下方法调用获得:

tables = HTML.findAll("table", {"class": "datadisplaytable"})

这只会返回HTML文档中与查询匹配的所有表。这一切都很好,但它返回重复的表(如下面的输出所示)。

我尝试过删除重复项:

tables = list(set(HTML.findAll("table", {"class": "datadisplaytable"})))

它会删除重复项,但不会保留我需要顺序

所以我尝试了这个:

holder = []

for item in tables:
    if item not in holder:
        holder.append(item)

然而,重复仍然存在。以上方法是否无法处理BeautifulSoup HTML?如果没有,如何删除BeautifulSoup HTML重复项并保留订单?

修改

tables = OrderedDict.fromkeys(HTML.findAll("table", {"class": "datadisplaytable"})).keys()

然后在打印时,它是免费的:

for item in tables:
    print "\n\n\n"
    print item

但是,当我尝试打印时执行以下操作,重复的内容又回来了。我疯了吗?

i = 0
while (i < len(tables)-1):
    print "\n\nitem[i]:   \n", tables[i]
    print "\n\nitem[i+1]:  \n", tables[i+1]

    i += 1

有什么想法吗?

item[i]:   
<table class="datadisplaytable" summary="This table lists the scheduled meeting times and assigned instructors for this class.."><caption class="captiontext">Scheduled Meeting Times</caption>
<tbody><tr>
<th class="ddheader" scope="col">Type</th>
<th class="ddheader" scope="col">Time</th>
<th class="ddheader" scope="col">Days</th>
<th class="ddheader" scope="col">Where</th>
<th class="ddheader" scope="col">Date Range</th>
<th class="ddheader" scope="col">Schedule Type</th>
<th class="ddheader" scope="col">Instructors</th>
</tr>
<tr>
<td class="dddefault">Class</td>
<td class="dddefault">2:00 pm - 3:15 pm</td>
<td class="dddefault">MWF</td>
<td class="dddefault">Manchester Hall 241</td>
<td class="dddefault">Jan 13, 2015 - May 07, 2015</td>
<td class="dddefault">Lecture</td>
<td class="dddefault">William H.  Turkett (<abbr title="Primary">P</abbr>)<a href="mailto:turketwh@wfu.edu" target="William H. Turkett"><img align="middle" alt="E-mail" border="0" class="headerImg" height="28" hspace="0" name="web_email" src="/wtlgifs/web_email.gif" title="E-mail" vspace="0" width="28"/></a></td>
</tr>
</tbody></table>


item[i+1]:  
<table class="datadisplaytable" summary="This layout table is used to present the schedule course detail"><caption class="captiontext">Linear Algebra I - MTH 121 - C</caption>
<tbody><tr>
<th class="ddlabel" colspan="2" scope="row">Associated Term:</th>
<td class="dddefault">Spring 2015</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row"><acronym title="Course Reference Number">CRN</acronym>:</th>
<td class="dddefault">19765</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row">Status:</th>
<td class="dddefault">**Web Registered** on Nov 05, 2014</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row">Assigned Instructor:</th>
<td class="dddefault">
Jason D. Gaddis<a href="mailto:gaddisjd@wfu.edu" target="Jason D. Gaddis"><img align="middle" alt="E-mail" border="0" class="headerImg" height="28" hspace="0" name="web_email" src="/wtlgifs/web_email.gif" title="E-mail" vspace="0" width="28"/></a>
</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row">Grade Mode:</th>
<td class="dddefault">Standard Letter</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row">Credits:</th>
<td class="dddefault">    4.000</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row">Level:</th>
<td class="dddefault">Undergraduate</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row">Campus:</th>
<td class="dddefault">Reynolda Campus  (UG)</td>
</tr>
</tbody></table>


item[i]:   
<table class="datadisplaytable" summary="This layout table is used to present the schedule course detail"><caption class="captiontext">Linear Algebra I - MTH 121 - C</caption>
<tbody><tr>
<th class="ddlabel" colspan="2" scope="row">Associated Term:</th>
<td class="dddefault">Spring 2015</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row"><acronym title="Course Reference Number">CRN</acronym>:</th>
<td class="dddefault">19765</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row">Status:</th>
<td class="dddefault">**Web Registered** on Nov 05, 2014</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row">Assigned Instructor:</th>
<td class="dddefault">
Jason D. Gaddis<a href="mailto:gaddisjd@wfu.edu" target="Jason D. Gaddis"><img align="middle" alt="E-mail" border="0" class="headerImg" height="28" hspace="0" name="web_email" src="/wtlgifs/web_email.gif" title="E-mail" vspace="0" width="28"/></a>
</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row">Grade Mode:</th>
<td class="dddefault">Standard Letter</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row">Credits:</th>
<td class="dddefault">    4.000</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row">Level:</th>
<td class="dddefault">Undergraduate</td>
</tr>
<tr>
<th class="ddlabel" colspan="2" scope="row">Campus:</th>
<td class="dddefault">Reynolda Campus  (UG)</td>
</tr>
</tbody></table>


item[i+1]:  
<table class="datadisplaytable" summary="This table lists the scheduled meeting times and assigned instructors for this class.."><caption class="captiontext">Scheduled Meeting Times</caption>
<tbody><tr>
<th class="ddheader" scope="col">Type</th>
<th class="ddheader" scope="col">Time</th>
<th class="ddheader" scope="col">Days</th>
<th class="ddheader" scope="col">Where</th>
<th class="ddheader" scope="col">Date Range</th>
<th class="ddheader" scope="col">Schedule Type</th>
<th class="ddheader" scope="col">Instructors</th>
</tr>
<tr>
<td class="dddefault">Class</td>
<td class="dddefault">12:30 pm - 1:45 pm</td>
<td class="dddefault">MWF</td>
<td class="dddefault">Carswell Hall 101</td>
<td class="dddefault">Jan 13, 2015 - May 07, 2015</td>
<td class="dddefault">Lecture</td>
<td class="dddefault">Jason Dale  Gaddis (<abbr title="Primary">P</abbr>)<a href="mailto:gaddisjd@wfu.edu" target="Jason D. Gaddis"><img align="middle" alt="E-mail" border="0" class="headerImg" height="28" hspace="0" name="web_email" src="/wtlgifs/web_email.gif" title="E-mail" vspace="0" width="28"/></a></td>
</tr>
</tbody></table>

1 个答案:

答案 0 :(得分:0)

相反,我依赖于桌子上独一无二的东西。例如,在summary属性:

summaries = set()
tables = []

for table in soup.find_all("table", {"class": "datadisplaytable"}):
    summary = table['summary']
    if summary not in summaries:
        summaries.add(summary)
        tables.append(table)