Question

我正在尝试从黄页中抓取数据。我已经成功使用过该刮板几次，但最近已停止工作。我在黄页网站上注意到了最近的变化，他们在黄页网站上添加了包含三个结果的Sponsored Links表。由于此更改，我的抓取工具唯一收到的是此“赞助商链接”表下方的广告。它不会检索任何结果。

我在哪里出错呢？

我在下面包含了我的代码。例如，它显示了在威斯康星州搜索7个11个位置的信息。

import requests
from bs4 import BeautifulSoup
import csv

my_url = "https://www.yellowpages.com/search?search_terms=7-eleven&geo_location_terms=WI&page={}"
for link in [my_url.format(page) for page in range(1,20)]:
  res = requests.get(link)
  soup = BeautifulSoup(res.text, "lxml")

placeHolder = []
for item in soup.select(".info"):
  try:
    name = item.select("[itemprop='name']")[0].text
  except Exception:
    name = ""
  try:
    streetAddress = item.select("[itemprop='streetAddress']")[0].text
  except Exception:
    streetAddress = ""
  try:
    addressLocality = item.select("[itemprop='addressLocality']")[0].text
  except Exception:
    addressLocality = ""
  try:
    addressRegion = item.select("[itemprop='addressRegion']")[0].text
  except Exception:
    addressRegion = ""
  try:
    postalCode = item.select("[itemprop='postalCode']")[0].text
  except Exception:
    postalCode = ""
  try:
    phone = item.select("[itemprop='telephone']")[0].text
  except Exception:
    phone = ""

  with open('yp-7-eleven-wi.csv', 'a') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow([name, streetAddress, addressLocality, addressRegion, postalCode, phone])

Answer 1

现有脚本中存在几个问题。您创建了一个for循环，该循环应该遍历19个不同的页面，而内容则限制在单个页面中。您定义的选择器不再包含这些元素。此外，您多次重复try:except阻止，这使您的刮板看起来很凌乱。您可以定义自定义函数来摆脱IndexError或AttributeError的问题。最后，您可以使用csv.DictWriter()将抓取的项目写入csv文件中。

试一试：

import requests
import csv
from bs4 import BeautifulSoup

placeHolder = []

urls = ["https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=WI&page={}".format(page) for page in range(1,5)]
for url in urls:
    res = requests.get(url)
    soup = BeautifulSoup(res.text, "lxml")

    def get_text(item,path): return item.select_one(path).text if item.select_one(path) else ""

    for item in soup.select(".info"):
      d = {}
      d['name'] = get_text(item,"a.business-name span")
      d['streetAddress'] = get_text(item,".street-address")
      d['addressLocality'] = get_text(item,".locality")
      d['addressRegion'] = get_text(item,".locality + span")
      d['postalCode'] = get_text(item,".locality + span + span")
      d['phone'] = get_text(item,".phones")
      placeHolder.append(d)

with open("yellowpageInfo.csv","w",newline="") as infile:
  writer = csv.DictWriter(infile,['name','streetAddress','addressLocality','addressRegion','postalCode','phone'])
  writer.writeheader()
  for elem in placeHolder:
    writer.writerow(elem)

Answer 2

刮Life的生命……斗争是真实的！

网站更改布局时，通常可能会更改元素和类名。 您要仔细检查更新并使用与页面元素，类名等相关的硬编码值（可能已更改）修复刮板中的所有内容

对page的快速检查表明，您要抓取的信息位于不同的结构中：

<div class="v-card">
    <div class="media-thumbnail"><a class="media-thumbnail-wrapper chain-img" href="/milwaukee-wi/mip/7-eleven-471900245?lid=471900245"
            data-analytics="{&quot;click_id&quot;:509}" data-impressed="1"><img class="lazy" alt="7-Eleven" src="//i2.ypcdn.com/blob/c625613c07118f48908d08ec3c5f5f9a9f813850_40.png?074020d"
                data-original="//i2.ypcdn.com/blob/c625613c07118f48908d08ec3c5f5f9a9f813850_40.png?074020d" width="40"
                height="40" style="display: block;"><noscript><img alt="7-Eleven" src="//i2.ypcdn.com/blob/c625613c07118f48908d08ec3c5f5f9a9f813850_40.png?074020d"
                    width="40" height="40"></noscript></a></div>
    <div class="info">
        <h2 class="n">2.&nbsp;<a class="business-name" href="/milwaukee-wi/mip/7-eleven-471900245?lid=471900245"
                data-analytics="{&quot;target&quot;:&quot;name&quot;,&quot;feature_click&quot;:&quot;&quot;}" rel=""
                data-impressed="1"><span>7-Eleven</span></a></h2>
        <div class="info-section info-primary">
            <div class="ratings" data-israteable="true"></div>
            <p class="adr"><span class="street-address">1624 W Wells St</span><span class="locality">Milwaukee,&nbsp;</span><span>WI</span>&nbsp;<span>53233</span></p>
            <div class="phones phone primary">(414) 342-9710</div>
        </div>
        <div class="info-section info-secondary">
            <div class="categories"><a href="/wi/convenience-stores" data-analytics="{&quot;click_id&quot;:1171,&quot;adclick&quot;:false,&quot;listing_features&quot;:&quot;category&quot;,&quot;events&quot;:&quot;&quot;}"
                    data-impressed="1">Convenience Stores</a></div>
            <div class="links"><a class="track-visit-website" href="https://www.7-eleven.com/locations/wi/milwaukee/1624-w-wells-st-35836?yext=35836"
                    rel="nofollow" target="_blank" data-analytics="{&quot;click_id&quot;:6,&quot;act&quot;:2,&quot;dku&quot;:&quot;https://www.7-eleven.com/locations/wi/milwaukee/1624-w-wells-st-35836?yext=35836&quot;,&quot;FL&quot;:&quot;url&quot;,&quot;target&quot;:&quot;website&quot;,&quot;LOC&quot;:&quot;https://www.7-eleven.com/locations/wi/milwaukee/1624-w-wells-st-35836?yext=35836&quot;,&quot;adclick&quot;:true}"
                    data-impressed="1">Website</a></div>
        </div>
        <div class="preferred-listing-features"></div>
        <div class="snippet">
            <p class="body"><span>From Business: At 7-Eleven, our doors are always open, and our friendly store teams
                    are ready to serve you. Our fresh, fast and convenient hot foods appeal to any craving, so yo…</span></p>
        </div>
    </div>
</div>

例如，对于地址，您需要itemprop=address而不是.street-address，依此类推。

对于本地化的嵌套示例，请使用模仿CSS样式选择器的内置选择器。

try:
  locality = item.select(".street-address")[0]
  addressLocality = locality.text
  state_zip = locality.findChildren("span") # returns a list
  state = state_zip[0]
  zip = state_zip[1]
  # Might want to add some checks if the state or zip is missing, etc.
except Exception:
  addressLocality = ""

摘要：

修复这些硬编码的值以使其与新的类名匹配，您应该重新开始工作。

Python中的黄页刮板停止工作

2 个答案: