我正在尝试使用python从无序列表中获取每个链接。我将如何从每个列表元素中提取href链接(即,提取href =“ al / bessemer / 4921-promenade-parkway”)?
uri = 'https://locations.fivebelow.com/al'
html = urlopen(uri)
soup = BeautifulSoup(html, 'lxml')
soup.find_all('ul', class_ = 'Directory-listLinks')
并返回此
[<ul class="Directory-listLinks"><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/bessemer/4921-promenade-parkway"><span class="Directory-listLinkText">Bessemer</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(3)" data-ya-track="todirectory" href="al/birmingham"><span class="Directory-listLinkText">Birmingham</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/cullman/1230-cullman-shopping-ctr-nw"><span class="Directory-listLinkText">Cullman</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/daphne/6850-13-highway-90"><span class="Directory-listLinkText">Daphne</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/decatur/1241-pointe-mallard-parkway"><span class="Directory-listLinkText">Decatur</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/dothan/3500-ross-clark-cir"><span class="Directory-listLinkText">Dothan</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/florence/390-cox-creek-parkway"><span class="Directory-listLinkText">Florence</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/foley/2528-s-mckenzie-street"><span class="Directory-listLinkText">Foley</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/fultondale/3453-lowery-parkway"><span class="Directory-listLinkText">Fultondale</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/gadsden/526-meighan-blvd-east"><span class="Directory-listLinkText">Gadsden</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(2)" data-ya-track="todirectory" href="al/huntsville"><span class="Directory-listLinkText">Huntsville</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/montgomery/7670-east-chase-parkway"><span class="Directory-listLinkText">Montgomery</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/oxford/50-commons-way"><span class="Directory-listLinkText">Oxford</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/prattville/1472-cotton-exchange"><span class="Directory-listLinkText">Prattville</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/tuscaloosa/1451-dr-edward-hillard-drive"><span class="Directory-listLinkText">Tuscaloosa</span></a></li></ul>]
它返回一个包含一个元素的列表,所有元素都在一个索引中。我想知道如何才能为每个列表项创建单独的列表项,然后从中拉出href链接。
谢谢!
答案 0 :(得分:0)
尝试使用SimplifiedDoc解决方案。
from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
uri = 'https://locations.fivebelow.com/al'
html = req.get(uri)
doc = SimplifiedDoc(html)
lstA = doc.getElementByClass('Directory-listLinks').listA(url=uri)
print ([a.url for a in lstA])
结果:
[u'https://locations.fivebelow.com/al/foley/2528-s-mckenzie-street', u'https://locations.fivebelow.com/al/oxford/50-commons-way', u'https://locations.fivebelow.com/al/decatur/1241-pointe-mallard-parkway', u'https://locations.fivebelow.com/al/prattville/1472-cotton-exchange', u'https://locations.fivebelow.com/al/bessemer/4921-promenade-parkway', u'https://locations.fivebelow.com/al/tuscaloosa/1451-dr-edward-hillard-drive', u'https://locations.fivebelow.com/al/daphne/6850-13-highway-90', u'https://locations.fivebelow.com/al/fultondale/3453-lowery-parkway', u'https://locations.fivebelow.com/al/dothan/3500-ross-clark-cir', u'https://locations.fivebelow.com/al/montgomery/7670-east-chase-parkway', u'https://locations.fivebelow.com/al/huntsville', u'https://locations.fivebelow.com/al/birmingham', u'https://locations.fivebelow.com/al/florence/390-cox-creek-parkway', u'https://locations.fivebelow.com/al/cullman/1230-cullman-shopping-ctr-nw', u'https://locations.fivebelow.com/al/gadsden/526-meighan-blvd-east']
答案 1 :(得分:0)
这是我与bs4和urllib.request的解决方案
from bs4 import BeautifulSoup
from urllib.request import urlopen
uri = 'https://locations.fivebelow.com/al'
html = urlopen(uri)
soup = BeautifulSoup(html, 'lxml')
li_list = (soup.find('ul', class_='Directory-listLinks')).find_all("li")
urls = []
for n in range(len(li_list)):
urls.append("https://locations.fivebelow.com/" + str(str(li_list[n])[105:]).split('"')[1])
print(urls)
结果:
['https://locations.fivebelow.com/al/bessemer/4921-promenade-parkway',
'https://locations.fivebelow.com/al/birmingham',
'https://locations.fivebelow.com/al/cullman/1230-cullman-shopping-ctr-nw',
'https://locations.fivebelow.com/al/daphne/6850-13-highway-90',
'https://locations.fivebelow.com/al/decatur/1241-pointe-mallard-parkway',
'https://locations.fivebelow.com/al/dothan/3500-ross-clark-cir',
'https://locations.fivebelow.com/al/florence/390-cox-creek-parkway',
'https://locations.fivebelow.com/al/foley/2528-s-mckenzie-street',
'https://locations.fivebelow.com/al/fultondale/3453-lowery-parkway',
'https://locations.fivebelow.com/al/gadsden/526-meighan-blvd-east',
'https://locations.fivebelow.com/al/huntsville',
'https://locations.fivebelow.com/al/montgomery/7670-east-chase-parkway',
'https://locations.fivebelow.com/al/oxford/50-commons-way',
'https://locations.fivebelow.com/al/prattville/1472-cotton-exchange',
'https://locations.fivebelow.com/al/tuscaloosa/1451-dr-edward-hillard-drive']