我已经在python中编写了一个脚本,使用names
函数从网站的登录页面中删除与之关联的所有links
和.get_links()
。然后我创建了另一个函数.get_info()
来到达另一个页面(使用从第一个函数派生的链接),以便从那里抓取电话号码。
我根本不需要创建第二个功能如果我的目标是解析该网页中的两个项目,因为它们已经在着陆页中可用。
但是,我希望我的解析器表现的方式是在第二个函数中打印names
(从第一个函数继续)以及那里的phone numbers
。最重要的是,我不想踢出第二个函数中定义的for loop
。如果for loop
不在第二个函数中,则不会出现问题。不使用for loop
我就可以获得所需的输出。
到目前为止,这是我的脚本:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "https://potguide.com/alaska/marijuana-dispensaries/"
def get_links(link):
session = requests.Session()
session.headers['User-Agent'] = 'Mozilla/5.0'
r = session.get(link)
soup = BeautifulSoup(r.text,"lxml")
for items in soup.select("#StateStores .basic-listing"):
name = items.select_one("h4 a").text
namelink = urljoin(link,items.select_one("h4 a").get("href")) ##making it a fully qualified url
get_info(session,name,namelink) ##passing session in order to reuse it
def get_info(session,title,url):
r = session.get(url)
soup = BeautifulSoup(r.text,"lxml")
for items in soup.select("ul.list-unstyled"): ##if I did not use for loop I could get the output as desired.
try:
phone = items.select_one("a[href^='tel:']").text
except:
phone = ""
print(title,phone)
if __name__ == '__main__':
get_links(url)
输出我有:
AK Frost
AK Frost
AK Frost
AK Frost
AK Frost
AK Frost (907) 563-9333
AK Frost
AK Frost
AK Frost (907) 563-9333
AK Frost
AK Fuzzy Budz
AK Fuzzy Budz (907) 644-2838
AK Fuzzy Budz
AK Fuzzy Budz
AK Fuzzy Budz (907) 644-2838
我的预期输出:
AK Frost (907) 563-9333
AK Fuzzy Budz (907) 644-2838
答案 0 :(得分:6)
如果目标只是获得预期的输出,那么这应该有效:
def get_info(session,title,url):
r = session.get(url)
soup = BeautifulSoup(r.text,"lxml")
for items in soup.select("ul.list-unstyled"):
try:
phone = items.select_one("a[href^='tel:']").text
except:
# skip item and continue
continue
else:
# exception wasn't rised, you have the phone
print(title,phone)
break
答案 1 :(得分:4)
在我看来,你应该利用已经以结构化格式保存你的数据(以及更多)的基础javascript字典。
您可以使用yaml
将javascript字典转换为Python dict
对象。您可以轻松访问字典字段,例如id
,name
,city
,address
,city
,state
等。
这是一个有效的例子:
import json, re, requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import yaml
url = "https://potguide.com/alaska/marijuana-dispensaries/"
def get_links(link):
session = requests.Session()
session.headers['User-Agent'] = 'Mozilla/5.0'
r = session.get(link)
soup = BeautifulSoup(r.text,"lxml")
for items in soup.select("#StateStores .basic-listing"):
name = items.select_one("h4 a").text
namelink = urljoin(link,items.select_one("h4 a").get("href"))
get_info(session, name, namelink)
def get_info(session, title, url):
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")
script = next((i for i in map(str, soup.find_all("script", type="text/javascript"))
if 'mapOptions' in i), None)
if script:
js_dict = script.split('__mapOptions = ')[1].split(';\n')[0]
d = yaml.load(js_dict)
print(title, d['mapStore']['phone'])
get_links(url)
结果:
AK Frost (907) 563-9333
AK Fuzzy Budz (907) 644-2838
AK Joint (907) 522-5222
AK Slow Burn (907) 868-1450
Alaska Fireweed (907) 258-9333
...
Bad Gramm3r (907) 357-0420
Green Degree (907) 376-3155
Green Jar (907) 631-3800
Rosebuds Shatter House (907) 376-9334
Happy Cannabis (907) 305-0292
答案 2 :(得分:3)
我认为子页面中ul.list-unstyled
的选择过于宽泛,其中有太多内容并不是您真正想要的。
如果您真的只想要电话号码,可以直接搜索href以" tel:"开头的a
标签。问题仍然是这些网站以这种方式列出多个数字,通常为2,其中一个不可见。可见的那个看起来总是在下面div.col-md-3
。我试过这个:
def get_info(session,title,url):
r = session.get(url)
soup = BeautifulSoup(r.text,"lxml")
for a_phone in soup.select("div.col-md-3 a[href^='tel:']"):
print(title, a_phone.text)
并得到以下结果:
AK Frost (907) 563-9333
AK Fuzzy Budz (907) 644-2838
AK Joint (907) 522-5222
AK Slow Burn (907) 868-1450
Alaska Fireweed (907) 258-9333
Alaskabuds (907) 334-6420
Alaskan Leaf (907) 770-0262
Alaska's Green Light District (907) 644-2839
AM Delight (907) 229-1730
Arctic Herbery (907) 222-1466
Cannabaska (907) 375-9333
Catalyst Cannabis Company (907) 344-0668
Dankorage (907) 279-3265
Enlighten Alaska (907) 290-8559
Great Northern Cannabis (907) 929-9333
Hillside Natural Wellness (907) 868-8639
Hollyweed 907 (907) 929-3331
Raspberry Roots (907) 522-2450
Satori (907) 222-5420
The House of Green (907) 929-3105
Uncle Herb's (907) 561-4372
The Green Spot (907) 354-7044
Denali's Cannabis Cache (907) 683-2633
GOOD (907) 452-5463
Goodsinse (907) 347-7689
Grass Station 49 (907) 374-4420
Green Life Supply (907) 374-4769
One Hit Wonder (844) 420-1448
Pakalolo Supply Company (907) 479-9000
Rebel Roots (907) 455-4055
True Dank (907) 451-4516
The Herbal Cache (907) 783-0420
Denali 420 Recreationals (907) 892-9333
Glacier Valley Shoppe (907) 419-7943
Green Elephant (907) 290-8400
Rainforest Farms (907) 209-2670
The Fireweed Factory (907) 957-2670
Red Run Cannabis Company (907) 283-0800
Cannabis Corner (907) 225-4420
Rainforest Cannabis (907) 247-9333
The Stoney Moose (907) 617-8973
Chena Cannabis (907) 488-0489
The 420 (907) 772-3673
Green Leaf (907) 623-0332
Weed Dudes (907) 623-0605
Remedy Shoppe (907) 983-3345
Fat Tops (907) 953-2470
High Bush Buds (907) 953-9393
Pine Street Cannabis Company (907) 260-3330
Permafrost Distributors (907) 260-7584
Hilltop Premium Green (907) 745-4425
The High Expedition Company (907) 733-0911
Herbal Outfitters (907) 835-4201
Bad Gramm3r (907) 357-0420
Green Degree (907) 376-3155
Green Jar (907) 631-3800
Rosebuds Shatter House (907) 376-9334
Happy Cannabis (907) 305-0292
答案 3 :(得分:3)
你已经得到了足够好的答案,但你也可以尝试一下:
def get_info(session,title,url):
r = session.get(url)
soup = BeautifulSoup(r.text,"lxml")
for items in soup.select("ul.list-unstyled"):
if len(items.select("a[href^='tel:']")):
phone = items.select("a[href^='tel:']")[0].text
break
else:
phone = "N/A"
print(title, phone)
或某种单行代码:)
def get_info(session,title,url):
r = session.get(url)
soup = BeautifulSoup(r.text,"lxml")
phone = ([items.select("a[href^='tel:']")[0].text for items in soup.select("ul.list-unstyled")
if len(items.select("a[href^='tel:']"))] + ["N/A"])[0]
print(title, phone)
请注意,如果找不到电话号码,则会分配"N/A"
(例如Northern Lights Indoor Gardens N/A
)