Question

我正在尝试使用python从无序列表中获取每个链接。我将如何从每个列表元素中提取href链接（即，提取href =“ al / bessemer / 4921-promenade-parkway”）？

uri = 'https://locations.fivebelow.com/al'
html = urlopen(uri)
soup = BeautifulSoup(html, 'lxml')
soup.find_all('ul', class_ = 'Directory-listLinks')

并返回此

[<ul class="Directory-listLinks"><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/bessemer/4921-promenade-parkway"><span class="Directory-listLinkText">Bessemer</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(3)" data-ya-track="todirectory" href="al/birmingham"><span class="Directory-listLinkText">Birmingham</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/cullman/1230-cullman-shopping-ctr-nw"><span class="Directory-listLinkText">Cullman</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/daphne/6850-13-highway-90"><span class="Directory-listLinkText">Daphne</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/decatur/1241-pointe-mallard-parkway"><span class="Directory-listLinkText">Decatur</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/dothan/3500-ross-clark-cir"><span class="Directory-listLinkText">Dothan</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/florence/390-cox-creek-parkway"><span class="Directory-listLinkText">Florence</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/foley/2528-s-mckenzie-street"><span class="Directory-listLinkText">Foley</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/fultondale/3453-lowery-parkway"><span class="Directory-listLinkText">Fultondale</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/gadsden/526-meighan-blvd-east"><span class="Directory-listLinkText">Gadsden</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(2)" data-ya-track="todirectory" href="al/huntsville"><span class="Directory-listLinkText">Huntsville</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/montgomery/7670-east-chase-parkway"><span class="Directory-listLinkText">Montgomery</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/oxford/50-commons-way"><span class="Directory-listLinkText">Oxford</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/prattville/1472-cotton-exchange"><span class="Directory-listLinkText">Prattville</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/tuscaloosa/1451-dr-edward-hillard-drive"><span class="Directory-listLinkText">Tuscaloosa</span></a></li></ul>]

它返回一个包含一个元素的列表，所有元素都在一个索引中。我想知道如何才能为每个列表项创建单独的列表项，然后从中拉出href链接。

谢谢！

Answer 1

尝试使用SimplifiedDoc解决方案。

from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
uri = 'https://locations.fivebelow.com/al'
html = req.get(uri)
doc = SimplifiedDoc(html)
lstA = doc.getElementByClass('Directory-listLinks').listA(url=uri)
print ([a.url for a in lstA])

结果：

[u'https://locations.fivebelow.com/al/foley/2528-s-mckenzie-street', u'https://locations.fivebelow.com/al/oxford/50-commons-way', u'https://locations.fivebelow.com/al/decatur/1241-pointe-mallard-parkway', u'https://locations.fivebelow.com/al/prattville/1472-cotton-exchange', u'https://locations.fivebelow.com/al/bessemer/4921-promenade-parkway', u'https://locations.fivebelow.com/al/tuscaloosa/1451-dr-edward-hillard-drive', u'https://locations.fivebelow.com/al/daphne/6850-13-highway-90', u'https://locations.fivebelow.com/al/fultondale/3453-lowery-parkway', u'https://locations.fivebelow.com/al/dothan/3500-ross-clark-cir', u'https://locations.fivebelow.com/al/montgomery/7670-east-chase-parkway', u'https://locations.fivebelow.com/al/huntsville', u'https://locations.fivebelow.com/al/birmingham', u'https://locations.fivebelow.com/al/florence/390-cox-creek-parkway', u'https://locations.fivebelow.com/al/cullman/1230-cullman-shopping-ctr-nw', u'https://locations.fivebelow.com/al/gadsden/526-meighan-blvd-east']

Answer 2

这是我与bs4和urllib.request的解决方案

from bs4 import BeautifulSoup
from urllib.request import urlopen

uri = 'https://locations.fivebelow.com/al'
html = urlopen(uri)
soup = BeautifulSoup(html, 'lxml')
li_list = (soup.find('ul', class_='Directory-listLinks')).find_all("li")
urls = []
for n in range(len(li_list)):
    urls.append("https://locations.fivebelow.com/" + str(str(li_list[n])[105:]).split('"')[1])

print(urls)

结果：

['https://locations.fivebelow.com/al/bessemer/4921-promenade-parkway', 
'https://locations.fivebelow.com/al/birmingham', 
'https://locations.fivebelow.com/al/cullman/1230-cullman-shopping-ctr-nw', 
'https://locations.fivebelow.com/al/daphne/6850-13-highway-90', 
'https://locations.fivebelow.com/al/decatur/1241-pointe-mallard-parkway', 
'https://locations.fivebelow.com/al/dothan/3500-ross-clark-cir', 
'https://locations.fivebelow.com/al/florence/390-cox-creek-parkway', 
'https://locations.fivebelow.com/al/foley/2528-s-mckenzie-street', 
'https://locations.fivebelow.com/al/fultondale/3453-lowery-parkway', 
'https://locations.fivebelow.com/al/gadsden/526-meighan-blvd-east', 
'https://locations.fivebelow.com/al/huntsville', 
'https://locations.fivebelow.com/al/montgomery/7670-east-chase-parkway', 
'https://locations.fivebelow.com/al/oxford/50-commons-way', 
'https://locations.fivebelow.com/al/prattville/1472-cotton-exchange', 
'https://locations.fivebelow.com/al/tuscaloosa/1451-dr-edward-hillard-drive']

使用Beautiful Soup从ul中提取所有href

2 个答案: