我正在尝试从黄页中抓取数据。我已经成功使用过该刮板几次,但最近已停止工作。我在黄页网站上注意到了最近的变化,他们在黄页网站上添加了包含三个结果的Sponsored Links表。由于此更改,我的抓取工具唯一收到的是此“赞助商链接”表下方的广告。它不会检索任何结果。
我在哪里出错呢?
我在下面包含了我的代码。例如,它显示了在威斯康星州搜索7个11个位置的信息。
import requests
from bs4 import BeautifulSoup
import csv
my_url = "https://www.yellowpages.com/search?search_terms=7-eleven&geo_location_terms=WI&page={}"
for link in [my_url.format(page) for page in range(1,20)]:
res = requests.get(link)
soup = BeautifulSoup(res.text, "lxml")
placeHolder = []
for item in soup.select(".info"):
try:
name = item.select("[itemprop='name']")[0].text
except Exception:
name = ""
try:
streetAddress = item.select("[itemprop='streetAddress']")[0].text
except Exception:
streetAddress = ""
try:
addressLocality = item.select("[itemprop='addressLocality']")[0].text
except Exception:
addressLocality = ""
try:
addressRegion = item.select("[itemprop='addressRegion']")[0].text
except Exception:
addressRegion = ""
try:
postalCode = item.select("[itemprop='postalCode']")[0].text
except Exception:
postalCode = ""
try:
phone = item.select("[itemprop='telephone']")[0].text
except Exception:
phone = ""
with open('yp-7-eleven-wi.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([name, streetAddress, addressLocality, addressRegion, postalCode, phone])
答案 0 :(得分:2)
现有脚本中存在几个问题。您创建了一个for循环,该循环应该遍历19个不同的页面,而内容则限制在单个页面中。您定义的选择器不再包含这些元素。此外,您多次重复try:except
阻止,这使您的刮板看起来很凌乱。您可以定义自定义函数来摆脱IndexError
或AttributeError
的问题。最后,您可以使用csv.DictWriter()
将抓取的项目写入csv文件中。
试一试:
import requests
import csv
from bs4 import BeautifulSoup
placeHolder = []
urls = ["https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=WI&page={}".format(page) for page in range(1,5)]
for url in urls:
res = requests.get(url)
soup = BeautifulSoup(res.text, "lxml")
def get_text(item,path): return item.select_one(path).text if item.select_one(path) else ""
for item in soup.select(".info"):
d = {}
d['name'] = get_text(item,"a.business-name span")
d['streetAddress'] = get_text(item,".street-address")
d['addressLocality'] = get_text(item,".locality")
d['addressRegion'] = get_text(item,".locality + span")
d['postalCode'] = get_text(item,".locality + span + span")
d['phone'] = get_text(item,".phones")
placeHolder.append(d)
with open("yellowpageInfo.csv","w",newline="") as infile:
writer = csv.DictWriter(infile,['name','streetAddress','addressLocality','addressRegion','postalCode','phone'])
writer.writeheader()
for elem in placeHolder:
writer.writerow(elem)
答案 1 :(得分:1)
刮Life的生命……斗争是真实的!
网站更改布局时,通常可能会更改 元素和类名。 您要仔细检查更新并使用与页面元素,类名等相关的硬编码值(可能已更改)修复刮板中的所有内容
对page的快速检查表明,您要抓取的信息位于不同的结构中:
<div class="v-card">
<div class="media-thumbnail"><a class="media-thumbnail-wrapper chain-img" href="/milwaukee-wi/mip/7-eleven-471900245?lid=471900245"
data-analytics="{"click_id":509}" data-impressed="1"><img class="lazy" alt="7-Eleven" src="//i2.ypcdn.com/blob/c625613c07118f48908d08ec3c5f5f9a9f813850_40.png?074020d"
data-original="//i2.ypcdn.com/blob/c625613c07118f48908d08ec3c5f5f9a9f813850_40.png?074020d" width="40"
height="40" style="display: block;"><noscript><img alt="7-Eleven" src="//i2.ypcdn.com/blob/c625613c07118f48908d08ec3c5f5f9a9f813850_40.png?074020d"
width="40" height="40"></noscript></a></div>
<div class="info">
<h2 class="n">2. <a class="business-name" href="/milwaukee-wi/mip/7-eleven-471900245?lid=471900245"
data-analytics="{"target":"name","feature_click":""}" rel=""
data-impressed="1"><span>7-Eleven</span></a></h2>
<div class="info-section info-primary">
<div class="ratings" data-israteable="true"></div>
<p class="adr"><span class="street-address">1624 W Wells St</span><span class="locality">Milwaukee, </span><span>WI</span> <span>53233</span></p>
<div class="phones phone primary">(414) 342-9710</div>
</div>
<div class="info-section info-secondary">
<div class="categories"><a href="/wi/convenience-stores" data-analytics="{"click_id":1171,"adclick":false,"listing_features":"category","events":""}"
data-impressed="1">Convenience Stores</a></div>
<div class="links"><a class="track-visit-website" href="https://www.7-eleven.com/locations/wi/milwaukee/1624-w-wells-st-35836?yext=35836"
rel="nofollow" target="_blank" data-analytics="{"click_id":6,"act":2,"dku":"https://www.7-eleven.com/locations/wi/milwaukee/1624-w-wells-st-35836?yext=35836","FL":"url","target":"website","LOC":"https://www.7-eleven.com/locations/wi/milwaukee/1624-w-wells-st-35836?yext=35836","adclick":true}"
data-impressed="1">Website</a></div>
</div>
<div class="preferred-listing-features"></div>
<div class="snippet">
<p class="body"><span>From Business: At 7-Eleven, our doors are always open, and our friendly store teams
are ready to serve you. Our fresh, fast and convenient hot foods appeal to any craving, so yo…</span></p>
</div>
</div>
</div>
例如,对于地址,您需要itemprop=address
而不是.street-address
,依此类推。
对于本地化的嵌套示例,请使用模仿CSS
样式选择器的内置选择器。
try:
locality = item.select(".street-address")[0]
addressLocality = locality.text
state_zip = locality.findChildren("span") # returns a list
state = state_zip[0]
zip = state_zip[1]
# Might want to add some checks if the state or zip is missing, etc.
except Exception:
addressLocality = ""
摘要:
修复这些硬编码的值以使其与新的类名匹配,您应该重新开始工作。