我有来自this网站的国家/地区列表,其值为
(注意:这是迭代其元素后all_countries
的输出)
<a data-flexible="" SELECT data-id="AU" href="http://www.wotif.com/AU">Australia</a>
<a data-flexible="" data-id="NZ" href="http://www.wotif.com/NZ">New Zealand</a>
<a data-flexible="" data-id="ID" href="http://www.wotif.com/ID">Indonesia</a>
<a data-flexible="" data-id="TH" href="http://www.wotif.com/TH">Thailand</a>
<a data-flexible="" data-id="SG" href="http://www.wotif.com/SG">Singapore</a>
<a data-flexible="" data-id="GB" href="http://www.wotif.com/GB">United Kingdom</a>
<a data-flexible="" data-id="TH" href="http://www.wotif.com/TH">Thailand</a>
<a data-flexible="" data-id="AU" href="http://www.wotif.com/AU">Australia</a>
<a data-flexible="" data-id="AR" href="http://www.wotif.com/AR">Argentina</a>
<a data-flexible="" data-id="NZ" href="http://www.wotif.com/NZ">New Zealand</a>
我想做的是获得唯一的独特国家
这就是我的尝试。
all_countries = countries.select('div#country-box ul li a')
for index,value in enumerate(all_countries):
print(value)
all_countries[index] = value.text
all_countries = set(all_countries)
all_countries = list(all_countries)
for index,value in enumerate(all_countries):
print(value)
嗯,好吧,我现在有了独特的元素,但它没有维护那些在MultiSelectList网站上出现的国家的顺序,我还需要属性data-id
和href
的值以及文本a
标记,以供我以后在我的脚本中使用。
如果我这样做
all_countries = countries.select('div#country-box ul li a')
all_countries = set(all_countries)
all_countries = list(all_countries)
这会是一个好方法吗?
答案 0 :(得分:1)
使用set
存储已见过的data-id
s。
from bs4 import BeautifulSoup
def iter_uniq_link(all_countries):
seen = set()
for c in all_countries:
data_id = c.get('data-id')
if data_id not in seen:
seen.add(data_id)
yield c
用法:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''
... <body>
... <div id="country-box">
... <ul>
... <li>
... <a data-flexible="" SELECT data-id="AU" href="http://www.wotif.com/AU">Australia</a>
... <a data-flexible="" data-id="NZ" href="http://www.wotif.com/NZ">New Zealand</a>
... <a data-flexible="" data-id="ID" href="http://www.wotif.com/ID">Indonesia</a>
... <a data-flexible="" data-id="TH" href="http://www.wotif.com/TH">Thailand</a>
... <a data-flexible="" data-id="SG" href="http://www.wotif.com/SG">Singapore</a>
... <a data-flexible="" data-id="GB" href="http://www.wotif.com/GB">United Kingdom</a>
... <a data-flexible="" data-id="TH" href="http://www.wotif.com/TH">Thailand</a>
... <a data-flexible="" data-id="AU" href="http://www.wotif.com/AU">Australia</a>
... <a data-flexible="" data-id="AR" href="http://www.wotif.com/AR">Argentina</a>
... <a data-flexible="" data-id="NZ" href="http://www.wotif.com/NZ">New Zealand</a>
... </li>
... </ul>
... </div>
... </body>
... ''')
>>> all_countries = soup.select('div#country-box ul li a')
>>> list(iter_uniq_link(all_countries))
[<a data-flexible="" data-id="AU" href="http://www.wotif.com/AU" select="">Australia</a>,
<a data-flexible="" data-id="NZ" href="http://www.wotif.com/NZ">New Zealand</a>,
<a data-flexible="" data-id="ID" href="http://www.wotif.com/ID">Indonesia</a>,
<a data-flexible="" data-id="TH" href="http://www.wotif.com/TH">Thailand</a>,
<a data-flexible="" data-id="SG" href="http://www.wotif.com/SG">Singapore</a>,
<a data-flexible="" data-id="GB" href="http://www.wotif.com/GB">United Kingdom</a>,
<a data-flexible="" data-id="AR" href="http://www.wotif.com/AR">Argentina</a>]
答案 1 :(得分:0)
维护顺序和唯一性的一种可能方法是使用OrderedDict。将data-id
的每个唯一值作为键添加到OrderedDict中。
https://docs.python.org/3.3/library/collections.html#collections.OrderedDict
在迭代它时,将密钥添加到这样的字典中将保留它们的插入顺序(例如,使用.keys()
)。