找到所有<ul>
后,我想进一步提取文本和href
。我特别对这一点HTML所面临的问题是,我需要最多,但不是页面中的所有<li>
项。我看到当我find_all()
时,我返回了一个列表对象,不允许我将其作为汤对象进一步导航。
例如,在下面的代码段中,为了最终创建{'cityName': 'href',}
字典,我尝试过:
city_list = soup.find_all('ul', {'class': ''})
city_dict = {}
for city in city_list:
city_dict[city.text] = city['href']
以下是最小HTML示例:
<h4>Alabama</h4>
<ul>
<li><a href="https://auburn.craigslist.org/">auburn</a></li>
<li><a href="https://bham.craigslist.org/">birmingham</a></li>
<li><a href="https://tuscaloosa.craigslist.org/">tuscaloosa</a></li>
</ul>
<h4>Alaska</h4>
<ul>
<li><a href="https://anchorage.craigslist.org/">anchorage / mat-su</a></li>
<li><a href="https://juneau.craigslist.org/">southeast alaska</a></li>
</ul>
<h4>Arizona</h4>
<ul>
<li><a href="https://flagstaff.craigslist.org/">flagstaff / sedona</a></li>
<li><a href="https://yuma.craigslist.org/">yuma</a></li>
</ul>
<ul>
<li><a href="https://www.craigslist.org/about/help/">help</a></li>
<li><a href="https://www.craigslist.org/about/scams">safety</a></li>
<li class="fsel mobile linklike" data-mode="regular">desktop</li>
</ul>
我基本上可以find_all()
ul
首先,然后进一步找到我感兴趣的li
?
答案 0 :(得分:0)
可能你需要这样的东西:
city_dict = {}
for ul in soup.find_all('ul', {'class': ''}):
state_name = ul.find_previous_sibling('h4').text
print(state_name)
for link in ul.find_all('a'):
print(link['href'])
答案 1 :(得分:0)
试试这个,谢谢我以后:)
list_items = soup.find_all('ul',{'class':''})
list_of_dicts = []
for item in list_items:
for i in item.find_all('li'):
new_dict = {i.text:i.a.get('href')}
list_of_dicts.append(new_dict)
答案 2 :(得分:0)
city_dict = {}
for li in soup.find_all('li'):
city_name = li.text
for link in li.find_all('a'):
city_dict[city_name] = link['href']