从BeautifulSoup返回的列表中提取唯一元素

时间:2014-12-07 12:51:12

标签: python python-3.x beautifulsoup

我有来自this网站的国家/地区列表,其值为

(注意:这是迭代其元素后all_countries的输出)

<a  data-flexible="" SELECT data-id="AU" href="http://www.wotif.com/AU">Australia</a>
<a  data-flexible="" data-id="NZ" href="http://www.wotif.com/NZ">New Zealand</a>
<a  data-flexible="" data-id="ID" href="http://www.wotif.com/ID">Indonesia</a>
<a  data-flexible="" data-id="TH" href="http://www.wotif.com/TH">Thailand</a>
<a  data-flexible="" data-id="SG" href="http://www.wotif.com/SG">Singapore</a>
<a  data-flexible="" data-id="GB" href="http://www.wotif.com/GB">United Kingdom</a>
<a  data-flexible="" data-id="TH" href="http://www.wotif.com/TH">Thailand</a>
<a  data-flexible="" data-id="AU" href="http://www.wotif.com/AU">Australia</a>
<a  data-flexible="" data-id="AR" href="http://www.wotif.com/AR">Argentina</a>
<a  data-flexible="" data-id="NZ" href="http://www.wotif.com/NZ">New Zealand</a>

我想做的是获得唯一的独特国家

这就是我的尝试。

all_countries = countries.select('div#country-box ul li a')

for index,value in enumerate(all_countries):
    print(value)
    all_countries[index] = value.text

all_countries = set(all_countries)
all_countries = list(all_countries)

for index,value in enumerate(all_countries):
    print(value)

嗯,好吧,我现在有了独特的元素,但它没有维护那些在MultiSelectList网站上出现的国家的顺序,我还需要属性data-idhref的值以及文本a标记,以供我以后在我的脚本中使用。

如果我这样做

all_countries = countries.select('div#country-box ul li a')
all_countries = set(all_countries)

all_countries = list(all_countries)

这会是一个好方法吗?

2 个答案:

答案 0 :(得分:1)

使用set存储已见过的data-id s。

from bs4 import BeautifulSoup


def iter_uniq_link(all_countries):
    seen = set()
    for c in all_countries:
        data_id = c.get('data-id')
        if data_id not in seen:
            seen.add(data_id)
            yield c

用法:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''
... <body>
...     <div id="country-box">
...         <ul>
...             <li>
...                 <a  data-flexible="" SELECT data-id="AU" href="http://www.wotif.com/AU">Australia</a>
...                 <a  data-flexible="" data-id="NZ" href="http://www.wotif.com/NZ">New Zealand</a>
...                 <a  data-flexible="" data-id="ID" href="http://www.wotif.com/ID">Indonesia</a>
...                 <a  data-flexible="" data-id="TH" href="http://www.wotif.com/TH">Thailand</a>
...                 <a  data-flexible="" data-id="SG" href="http://www.wotif.com/SG">Singapore</a>
...                 <a  data-flexible="" data-id="GB" href="http://www.wotif.com/GB">United Kingdom</a>
...                 <a  data-flexible="" data-id="TH" href="http://www.wotif.com/TH">Thailand</a>
...                 <a  data-flexible="" data-id="AU" href="http://www.wotif.com/AU">Australia</a>
...                 <a  data-flexible="" data-id="AR" href="http://www.wotif.com/AR">Argentina</a>
...                 <a  data-flexible="" data-id="NZ" href="http://www.wotif.com/NZ">New Zealand</a>
...             </li>
...         </ul>
...     </div>
... </body>
... ''')
>>> all_countries = soup.select('div#country-box ul li a')
>>> list(iter_uniq_link(all_countries))
[<a data-flexible="" data-id="AU" href="http://www.wotif.com/AU" select="">Australia</a>,
 <a data-flexible="" data-id="NZ" href="http://www.wotif.com/NZ">New Zealand</a>,
 <a data-flexible="" data-id="ID" href="http://www.wotif.com/ID">Indonesia</a>,
 <a data-flexible="" data-id="TH" href="http://www.wotif.com/TH">Thailand</a>,
 <a data-flexible="" data-id="SG" href="http://www.wotif.com/SG">Singapore</a>,
 <a data-flexible="" data-id="GB" href="http://www.wotif.com/GB">United Kingdom</a>,
 <a data-flexible="" data-id="AR" href="http://www.wotif.com/AR">Argentina</a>]

答案 1 :(得分:0)

维护顺序和唯一性的一种可能方法是使用OrderedDict。将data-id的每个唯一值作为键添加到OrderedDict中。

https://docs.python.org/3.3/library/collections.html#collections.OrderedDict

在迭代它时,将密钥添加到这样的字典中将保留它们的插入顺序(例如,使用.keys())。